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Abstract 

Rater behavior in essay grading can be viewed as a signal-detection task, in that raters attempt to 
discriminate between latent classes of essays, with the latent classes being defined by a scoring 
rubric. The present report examines basic aspects of an approach to constructed-response (CR) 
scoring via a latent-class signal-detection model. The model provides a psychological framework 
for CR scoring and includes rater parameters with a clear cognitive basis. Simulations are used to 
examine how well rater parameters and latent-class sizes are recovered as well as the accuracy of 
classification. The relation of rater parameters to agreement statistics and classification accuracy 
is examined. The effects of using a balanced, incomplete block design are compared to those for 
a fully crossed design. The model is applied to several ETS datasets. 

Key words: Constructed responses, rater effects, signal detection theory, latent class models, 
classification, agreement, incomplete block design 
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Introduction 


Essays and other constructed-response (CR) items must be scored by raters. The use of 
raters to score CR items raises questions about how raters perform the task, an understanding of 
which in turn is important for the choice of a model of rater behavior. One approach is to view 
raters as attempting to classify each essay into a latent category, where the latent categories are 
defined by a scoring rubric. For example, a 1-6 scoring rubric, as used in the SAT®, GRE®, and 
Praxis™, can be viewed as defining six latent categories of essays, with the task of raters being 
to determine to which of the six categories each essay belongs. When viewed in this way, the 
task becomes one of signal detection, in that raters attempt to discriminate between latent 
categories of items. This suggests the use of a latent-class version of signal-detection theory 
(SDT) as a model of rater behavior. The approach offers a psychological framework for 
understanding CR scoring and includes rater parameters that have a clear cognitive basis. Up to 
this point, latent-class SDT models have been used primarily in medical diagnosis (see DeCarlo, 
2002). However, the approach recently has been used in education and in particular as a model of 
rater behavior in essay scoring (DeCarlo, 2005). The present report examines this approach in 
more detail. 

An immediate benefit of an approach to CR scoring via SDT is that it clarifies that the 
scores assigned by raters reflect two basic aspects of the task, a perceptual aspect and a decision 
aspect. This is illustrated in Figure 1. 



Figure 1. A representation of signal detection theory. 
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The perceptual aspect of the task refers to the view that, for holistic scoring, raters base 
their scores in part on their perception of the overall quality of an essay. A basic assumption in 
SDT is that the perceptions can be viewed as being realizations of a continuous, random variable 
with a specified probability distribution, such as the normal or logistic (other distributions can be 
used through the use of different link functions; see DeCarlo, 1998). In particular, it is assumed 
that there is a probability distribution for each latent class of essay, with a different location for 
each class, as shown in Figure 1. That is, Figure 1 shows that, for a 1-4 scoring rubric, it is 
assumed that raters attempt to discriminate between four latent classes of essays. Additionally, 
the perceptions of the quality of essays from a particular latent class can be represented by a 
probability distribution, with the result of four distributions, one for each latent class, with 
different locations. 

Of basic interest in SDT is a rater’s ability to discriminate between the latent classes, as 
measured by a discrimination parameter d, which is interpreted in SDT as a measure of the 
distance between the underlying perceptual distributions; a higher value of d indicates better 
discrimination and distributions that are further apart. In the version of SDT considered here, 
referred to as an equal spacing SDT model (DeCarlo, 2002, 2005), it is assumed that the raters 
perceive the latent classes as being equally spaced, and so the distance between perceptual 
distributions is the same for adjacent distributions, which gives distances of d, 2d, 2d, and so on, 
as shown in Figure 1. Note that the equal spacing is in the raters’ perceptions, and not the latent 
classes, which are only assumed to be ordinal. As shown below, the equal-distance restriction is 
implemented in the model by scoring the latent classes as 0, 1, 2, and so on. 

The decision aspect of the task has to do with a rater’s use of the response categories, that 
is, what a rater considers to be a Category 4 versus a Category 3, for example. In SDT, a rater’s 
category usage is reflected by his or her use of response criteria, Ck, which delineate the K 
categories, as shown in Figure 1. It is widely recognized that some raters tend to assign high 
scores (leniency), whereas others tend to assign low scores (strictness); in terms of SDT, this 
simply reflects the raters’ arbitrary use of response criteria, which are lower (i.e., further to the 
left) for lenient raters and higher for strict raters. The locations of the response criteria also 
reflect any and all other peculiarities in the rater’s response usage, such as avoiding end 
categories or spacing the categories unequally. Thus, SDT separates perceptual aspects of the 
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task (a rater’s ability to discriminate between the latent categories), from decision aspects (a 
rater’s use of response criteria). 

Another way to represent the model is shown in Figure 2, which uses a diagram similar to 
that used in structural equation modeling (e.g., see Kline, 2005). The observed responses, Yj, 
consist of ratings, such as from 1-6. As is well kn own in statistics and psychometrics, models 
with ordinal responses can be motivated by assuming a continuous underlying variable (e.g., 
Agresti, 2002), which is shown for each rater j as V P / - in Figure 2. In the SDT approach, V P / - 
represents a rater’s perception of the overall quality of an essay, as shown in Figure 1. As noted 
above, it is assumed that raters arrive at their observed responses by using their perceptions in 
conjunction with response criteria, shown as c in Figure 2. The arrows from V P / to Y (actually it’s 
the probability of Y, and not the observed T, but the diagram is simplified) are curved to indicate 
that the relation between the mean of 4^ and the response probabilities is nonlinear. As noted 
above and represented in Figure 2, the mean of the T, distribution is shifted by dj across the 
latent classes, which are denoted here as X* (i.e., X* is used here to denote a latent categorical 
variable, whereas X* is commonly used in statistics and econometrics to denote a latent 
continuous variable). 



Figure 2. A structural equation-like representation of latent-class signal-detection theory. 
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The latent-class SDT model can be written as follows. Consider the situation where J 
raters examine N cases (e.g., essays) and assign a discrete score k to each case, where I <k< K 
and K is the number of response categories. For the equal-distance version of the SDT model, the 
model is 


p(Y l <k\X # = x # ) = F(c jk -d l x # ) (1) 

where Yj is the response variable for rater j (e.g., a 1-6 response), X* is a latent categorical 
variable with values of x" from 0 to K~ 1 (note that this particular scoring implements the equal 
distance restriction), cjk are K~ 1 strictly ordered response criteria for the /'th rater and kt h 
response category, dj is the discrimination parameter for the /th rater, and F is a cumulative 
distribution function; the logistic cumulative distribution function is used here. 

To complete the model, Equation 1 is incorporated into a restricted latent class model, as 
shown in DeCarlo (2002, 2005). A latent-class model is a model for the probability of the 
response patterns (k\, ki,..., kj ) for the J raters and can be written as 

p(Yi = k h ...,Yj=kj) = Ya#p(X # = x # ) p(Yi = k h ...,Yj = kj\X # = x # ), (2) 

where the summation is over the latent classes JC. With an assumption of local independence, the 
second term on the right becomes 

p{Yi = k u • • • ,Yj = kj \X # = x # ) = IX- p(Yj = kj \X # = x # ), (3) 

where the product is over the J raters. The latent-class SDT model of Equation 1 is then used for 
the product on the right in Equation 3 by differencing the cumulative probabilities to get 
response probabilities, as done for item-response theory models such as the graded-response 
model (Samejima, 1969). Equations 1 and 3 are then incorporated into Equation 2 to complete 
the model. Note that, for the version of the latent-class SDT model considered here, which has K 
ordinal response categories for K latent classes, a minimum of three raters is generally needed in 
order for the model to be identified; this is not an issue for the large-scale assessments examined 
here, because many raters are used; yet, it is relevant when pooled data are analyzed (see the 
section on the second writing test below). 

As has been noted previously (DeCarlo, 2002, 2005), from a statistical perspective, the 
latent-class SDT model is closely related to several other models discussed in psychometrics. For 
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example, it can be viewed as a discretized version of the graded-response model (see Heinen, 
1996) and is also related to the D-factor models of Vermunt and Magidson (2007). The 
difference is that D-factor models use adjacent category logits, whereas SDT uses cumulative 
links, because of the motivation in terms of underlying distributions. Indeed, it should be clear 
from Figure 2 that the latent-class SDT model is a type of factor analysis model, albeit with a 
discrete factor. Previous research (DeCarlo, 2002, 2005) has compared the latent-class SDT 
model with D-factor models and item-response theory models. 

The present report examines an approach to CR scoring via latent-class SDT. A basic 
goal is to obtain information, through simulations, about how well the parameters are recovered 
and how accurate the classifications are. We have little or no information about this at this time. 
Also investigated are the relation of rater parameters to agreement statistics and the relation of 
rater discrimination to classification accuracy. 

For large-scale assessments, incomplete designs are a necessity, because there are a large 
number of essays; thus, not all of the raters can score all of the essays. Instead, each essay is 
graded by a subset of raters, typically 2. This makes it possible for a relatively small number of 
raters to score a relatively large number of essays. For example, if a balanced, incomplete block 
(BIB) design is used, with each essay scored by 2 raters, a total of 1,080 essays can be scored by 
10 raters, with each rater scoring 216 essays. The present report examines incomplete designs 
and compares the results to those obtained with complete (fully crossed) designs. Applications to 
real-world data are also presented. 

Simulated Data: Fully Crossed Design 

First examined are fully crossed designs, which provide a useful reference point for the 
incomplete designs examined below. Estimation and classification are examined using a range of 
values for the rater parameters that is consistent with that found for real-world data. 

Methods 

The simulated data were generated using SAS software macros written by the author 
(used in DeCarlo, 2005, and modified as needed for the current studies). Data for 10 raters 
discriminating between six latent classes by giving one to six responses were simulated. The 
latent class sizes were chosen to approximate a normal distribution (see Appendix A), which is 
consistent with the results found for the exams analyzed below. A range of values of d from 2-5 
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was used, which covers a range of detection from moderate to excellent (for the logistic model) 
and is consistent with that found for real-world data. For example, for the large-scale assessment 
examined below, the values of d for 44 raters ranged from 1.8-5.3. One also has to make 
decisions about the location of the criteria for the different values of d. A general approach, used 
here, is to locate the criteria at the intersection points of adjacent distributions, which has the 
convenient property that the relative locations of the criteria remain the same as d varies; some 
conditions where the criteria are not at the intersection points are also examined below. Relative 
to d, this means that the first through last criteria are located at 'Ad, 1 'Ad, 2 'Ad, and so on. That is, 
it should be obvious that the intersection points for symmetrical distributions are at the point 
midway between the two adjacent distributions. So, for example, for six latent classes and a d of 
2, the six distributions will be at 0, 2, 4, 6, 8, and 10 and the five response criteria will be at 1, 3, 
5, 7, and 9, and similarly for other values of d. A sample size of 1,080 was used for all 
conditions; the size of 1,080 was used instead of simply 1,000 because the incomplete design 
examined below is fully balanced for 10 raters with a sample size of 1,080. Each condition 
consisted of 100 replications. 

Data generation consisted of three steps. First, values for the latent variable X* (i.e., 0, 1, 
2, ..., K~ 1) were generated using a multinomial distribution, where the latent class sizes were 
used as the probabilities for each latent-class category. Next, the generated values of X* were 
used in Equation 1 along with the population parameters cyt and dj to get cumulative response 
probabilities for each rater and response category, using a logistic distribution fori 7 . To generate 
an observed response, the probabilities were compared to values obtained from a unifonn 
random variable generated on an interval from 0-1. If the value was less than or equal to the 
probability for the lowest response category, then a response of 1 was assigned; if it was greater 
than the probability for the lowest category, but less than or equal to the value for the second 
category, then a response of 2 was assigned, and so on. 

Several software packages can be used to fit the latent-class SDT model, such as LEM 
(Vermunt, 1997), Latent Gold (Vermunt & Magidson, 2007), and Mplus (Muthen & Muthen, 
2007). Some small simulations indicated that Latent Gold tended to have good perfonnance for 
the models considered here, and so it was used. In particular, a prerelease version of Latent Gold 
(demo Version 4.5), made available to the author, was used; Version 4.5 allows one to use syntax 
to specify a wide range of models, including the latent-class SDT model. Latent Gold uses the 
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expectation-maximization algorithm followed by the Newton-Raphson procedure to obtain 
maximum likelihood estimates of the parameters (unless Bayes constants are used; see the 
incomplete simulation below). A SAS macro written by the author was used to generate 100 
input files for the Latent Gold analysis and also a DOS batch file, which was used to call Latent 
Gold repeatedly to perform the analysis. Other SAS macros stripped out information from the 
Latent Gold output for each replication, and the results were combined in a file for the remaining 
analyses. 

One complication that must be recognized is known as label switching (McLachlan & 
Peel, 2000). In the current context, label switching has to do with the coding of the latent 
categorical variable X*, in that it is arbitrary as to which class is assigned a value of zero. For 
example, in some cases, the classes will be labeled as 0, 1,..., K- 1, and in other cases as K- 1, 

K~ 2,..., 0. The maximized log likelihood has the same value for the switched solution, the main 
consequence for the latent class SDT model is that the sign of d is reversed as is the order of the 
latent classes. In addition, when label switching occurs, one has to add K -1 times d to the 
obtained criteria estimates in order to obtain estimates of c. The SAS macro that stripped out and 
summarized the data checked for label switching and adjusted the computations appropriately. 

Results 

Rater parameters and latent-class sizes. Appendix A presents, for the rater parameters 
and latent-class sizes, the population parameters, the mean parameter estimates, the bias, the 
(absolute) percent bias (the parameter estimate minus the population value, divided by the 
population value, times 100; the absolute percent bias is shown here because the direction is 
obvious from the sign of the bias), and the mean squared error (MSE) for fits of the model to the 
100 sets of simulated data. 

Tables A1-A3 show that estimation is excellent for values of d from 2-4. The bias and 
MSE for the rater parameters are small, with a percent bias of generally less than 1% for d and 
less than 2% for c; percent bias less than 5% is usually viewed as trivial, values from 5-10% as 
moderate, and over 10% as large (e.g., Flora & Curran, 2004); here, percent bias of 10% or less 
is viewed as acceptable. The MSE is small, being less than 0.10 for values of d from 2-4, and 
generally less than 0.50 for the c. Estimation of the latent class sizes is also quite good, with a 
percent bias of less than 2%. 
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For d = 5, there were problems with convergence, with convergence for only 48 out of 
the 100 replications; Table A4 is based on the 48 cases where the program converged. It is not 
surprising to encounter estimation difficulties of this sort with larger values of d because the 
tables being analyzed become more sparse; that is, there tend to be more zero and low-count 
cells as d increases. In terms of a two-by-two table, for example, it should be apparent that the 
entries will concentrate more along the diagonal as d increases (and will all be on the diagonal 
for perfect discrimination). The fact that estimation problems arise with large values of the slope 
parameter (i.e., d) is well known in, for example, logistic regression (e.g., Hosmer & Lemeshow, 
1989; Rindskopf, 2002). 

Table A4 shows that, for the converged cases, estimation of d is very good, with a 
percent bias of 3% or less and a MSE of less than 0.10. With respect to the response criteria, the 
percent bias is larger but is generally less than 10%, except for the first criterion, which tends to 
have larger percent bias, around 12%. However, the MSEs for the response criteria are 
considerably larger, in the range of 10-20, which indicates that the estimates of c have large 
variability across replications. Estimation of the latent-class sizes is also problematic, with a 
small percent bias for the middle classes but large percent bias (> 20%) for the end classes. 

Tables A5 and A6 show examples where shifted criteria were used for ds of 2 and 3. In 
this case, the criteria for 2 of the 10 raters were shifted down from the intersection points 
locations by 2, the criteria for 2 other raters were shifted down by 1, the criteria for another 2 
raters were shifted up by 1, the criteria for another 2 raters were shifted up by 2, and the criteria 
for the remaining 2 raters were left at the intersection points. Tables A5 and A6 show that 
shifting the locations of the criteria had little effect on estimation, with the percent bias being 
below 5% for both the rater parameters and the latent class sizes. 

Standard errors. Appendix B presents results for the evaluation of the standard errors of 
d and the latent-class sizes (the response criteria are not of central interest; there are also some 
complexities with respect to evaluating the standard errors of c in a simulation because of label 
switching). Tables B1-B4 show that estimation of the standard errors of d is good, with a percent 
bias of 10% or less for values of d from 2-4; the standard errors of the latent class sizes also 
appear to be well recovered. For d of 5, the percent bias is larger, up to about 20% for the 
standard error of d. The percent bias is also larger for the latent-class sizes, particularly for the 
first and last classes, where it is around 90%. Table B4 shows that, for a d of 5, the bias is 
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consistently negative for the standard errors of both d and the latent class sizes, and so the 
standard errors tend to be underestimated. Tables B5 and B6 show that, for the shifted criteria 
conditions, the bias for the standard errors is again small for values of d of 2 and 3. 

To summarize, Appendixes A and B show that estimation of d and its standard error is 
quite good for values of d in the range of 2-5, whereas the response criteria and latent-class sizes 
are accurately estimated for values of d in the range of 2-4 but are less well estimated for a value 
of d of 5. Overall, the results indicate that if one wishes to assess the performance of the raters, in 
which case d is of primary interest, then one can obtain a good idea of rater performance, in that 
d is accurately estimated for the range of values examined here, which is similar to the range 
found in practice. 

Classification. Table 1 shows the classification accuracy (proportion correctly classified) 
for values of d ranging from 2-5. PC pre d is the predicted proportion correctly classified and is 
obtained from the posterior probabilities (it is basically the average of the maximum posterior 
probabilities across cases); this value is obtained when the model is fit and is therefore available 
for both simulated and real-world data. In contrast, PC 0 bt is only available in a simulation and is 
the obtained proportion of cases that were actually correctly classified in the simulation, where 
the cases are classified into the class with the maximum posterior probability. Table 1 also shows 
lambda (see Dayton, 1998; DeCarlo, 2002), both predicted and obtained, and two measures of 
association with the true latent classes, namely the Pearson correlation r and taub. Lambda 
adjusts the proportion correct using the largest latent class size, 

X = [PC - max p(X # )]/[l~ max p(X # )], (4) 

and reflects the improvement in classification accuracy over and above simply classifying all of 
the cases into the class with the largest size. 

Table 1 shows that the predicted proportion correctly classified (PC pre d) is .92 for a d of 2 
and is over .98 for ds from 3-5. A high value of PC is expected because the accuracy of 
classification increases with the number of raters, and 10 raters per essay is a relatively large 
number (as compared below to an incomplete design with only 2 raters per essay). For values of 
d from 2-4, the obtained proportion correctly classified (PC 0 bt) is close to the predicted value. 
Table 1 also shows that PC pre d overestimates PC G bt (the difference is very small in this case), as 
was also found by DeCarlo (2005). For a d of 5, the obtained PC is considerably smaller (.34); 
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this likely occurs in part because of poor estimation of the latent class sizes for a d of 5, 
particularly for the end classes, as was noted above. The problem appears to be that in situations 
with large d (and small d, though not shown), one or more of the estimated latent-class sizes 
tends to zero or near zero, and so the classifications tend to be off by one class (or more). This 
can be shown by computing the proportion correctly classified within one class, which was .99 
or larger in every case, including d = 5. Note that, even with poor classification accuracy for a d 
of 5, the Pearson correlation (.93) and taub (.93) are high, and so the classifications still reflect 
the order of the latent classes. 

Table 1 

Proportion Correctly Classified and Correlations With True Latent Classes, Fully Crossed 
Design 


d 

PCp re d 

PC 0 bt 

^pred 

kobt 

taub 

r 

Intersection-point criteria 

2 

.919 

.916 

.891 

.887 

.978 

.960 

3 

.986 

.985 

.981 

.979 

.996 

.993 

4 

.998 

.988 

.997 

.984 

.999 

.999 

5 a 

.986 

.335 

.982 

.104 

.931 

.932 

Shifted criteria 

2 

.911 

.908 

.880 

.876 

.975 

.957 

3 

.981 

.981 

.975 

.974 

.995 

.991 


Note. Ten raters per essay, d is the SDT discrimination parameter; PC pre d is the predicted 
proportion correct; PC 0 bt is the obtained proportion correct. 

a The d = 5 condition includes only 48/100 replications where the program converged. 


Table 1 also includes conditions for ds of 2 and 3 where the response criteria were shifted 
from the intersection points for 8 out of 10 raters (up or down by 1 or 2, see Appendix A). It is 
interesting to compare classification accuracy in these conditions to that for the intersection point 
criteria conditions. Table 1 shows that, for a d of 2 and 3, shifting the criteria has little effect on 
PC, either predicted or obtained, with a reduction in PC for shifted criteria of less than 1%. This 
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shows that, for a fully crossed design, the criteria locations have little effect on classification 
accuracy. 

Agreement. Table 2 shows the relation between d and agreement statistics. The agreement 
proportions and weighted kappas are the average, for each replication, across the 45 pairs of rater 
combinations. That is, they are the average pairwise agreement, which in turn is then averaged 
over the 100 replications. For weighted kappa, Cicchetti-Allison weights were used, as 
documented in the FREQ procedure of SAS. A value of Kendall’s coefficient of concordance W 
(Kendall & Smith, 1939), based on all 10 raters, was also computed for each replication and then 
averaged over the 100 replications. In contrast to the simple agreement statistic and Kappa, 
which only consider pairwise relations, IF is a measure of agreement across all of the raters (W 
examines agreement in rankings across the raters, where the rankings are obtained within each 
rater by using their scores); the fact that it takes into account that there are 10 raters is likely 
why, as shown next, W tends to be larger than the agreement statistic or kappa. 

The upper part of Table 2 is for the conditions with intersection-point criteria locations; 
the table shows that agreement increases from .27 to .75 as d varies from 1-5; weighted kappa 
ranges from .28 to .84 and Kendall’s W ranges from .47 to .90. In contrast, Table 1 shows that 
PC is greater than .90 for values of d from 2-5. For example, for <7 = 2, agreement is .37 (from 
Table 2), whereas the proportion correctly classified is .92 (from Table 1). Thus, agreement can 
be low while classification accuracy is high. 


Table 2 


Average Pairwise Agreement and Weighted Kappa 


d 

Criteria 

1 

2 

3 

4 

5 

Intersection-point criteria 

Agreement 

.267 

.366 

.502 

.636 

.752 

Weighted kappa 

.283 

.492 

.645 

.754 

.836 

Kendall’s W 

.469 

.111 

.838 

.899 

.936 

Shifted criteria 

Agreement 

— 

.290 

.380 

.482 

— 

Weighted kappa 

— 

.388 

.533 

.638 

— 

Kendall’s W 

— 

.711 

.829 

.881 

— 
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The lower portion of Table 2, for three conditions with shifted criteria, shows that 
agreement again increases with d. Most important, Table 2 shows that the percent agreement and 
kappa are smaller for the shifted criteria as compared to the intersection-point criteria. For 
example, for a <7 of 3, agreement decreases by 12%, that is, from 50% for intersection-point 
criteria to 38% for shifted criteria, which shows that agreement is heavily affected by the criteria 
locations, as is weighted kappa (an interesting result is that the shifted criteria only appear to 
have a small effect, less than .02, on Kendall’s W). In contrast, Table 1 shows that, for a d of 3 
with shifted criteria, PC decreases by less than one half of a percent. This shows that shifting the 
response criteria has a large effect on agreement but only a small effect on classification 
accuracy, which suggests limitations of using agreement in practice. 

Discussion 

The simulations show that, for a fully crossed design with 10 raters, the rater parameters 
are accurately recovered for a range of discrimination from 2-4, with d being accurately 
estimated for a range of 2-5. The response criteria and latent-class sizes are well recovered for ds 
from 2-4 but are less well recovered for values outside of this range. Of course, these results 
depend in part on the software being used and the particular set of parameters. For example, 
Latent Gold 4.5 accurately recovered the rater parameters for a range of d from 2-4, but there 
were problems even within this range with LEM (Vermunt, 1997), mostly a problem of obtaining 
latent-class size estimates of zero. Latent Gold has options, such as Bayes constants, that can 
help to ameliorate problems of this sort; this will be examined in future studies (also see the 
section on incomplete designs below). 

An important result is that classification accuracy increases with the raters’ level of 
discrimination, as does agreement. Thus, classification accuracy can be increased by improving 
raters’ discrimination. The levels of agreement, however, are not very informative about 
classification accuracy, as they are not meant to be. For example, for a d of 3 in a fully crossed 
design, agreement is 50% and weighted kappa is 64%, which are low to moderate, whereas 
classification accuracy is 98%, which is excellent. Thus, classification accuracy is high, yet 
average pairwise agreement is poor to moderate. 

Another important result shown in Table 2 is that agreement is heavily affected by the 
response criteria locations; for example, agreement dropped by over 10% when the criteria were 
shifted for some of the raters, whereas the effect on classification accuracy was negligible (less 
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than 1%). These results support the view that the discrimination parameter d is more informative 
in terms of evaluating the raters’ performance and the resulting classification accuracy. For 
example, the simulation shows that, with 10 raters in a fully crossed design, classification 
accuracy is high for values of d of 2 or more. The results also show the advantage of using 
model-based classifications over simply averaging the scores, in that rater differences in 
response criteria have little effect on the model-based classifications, as shown in Table 1 (more 
on this below). 


Simulated Data: Balanced Incomplete Block (BIB) Design 

The above simulations offer basic information about parameter recovery and 
classification in fully crossed designs. However, in practice, the designs used are incomplete, in 
that not all the raters rate all of the essays. This section examines the performance of the latent- 
class SDT model in situations with incomplete data. A BIB design, which is very efficient, is 
used. This provides information about the effects of incompleteness in a best case scenario and 
provides an important reference point for future studies using other types of incomplete designs, 
such as unbalanced designs. 

Methods 

The data generated for the fully crossed design were used. A SAS macro was used on the 
data to create missing values according to a BIB design for 10 raters and 1,080 cases, with each 
rater scoring 2 essays. The incomplete aspect of the design is that each essay is scored by only 2 
out of 10 raters, whereas the balanced aspect is that (a) each essay is scored by 2 raters, (b) each 
rater scores 216 essays, and (c) each rater is paired with every other rater an equal number of 
times. Each condition consisted of 100 replications. 

Data that are missing by design, as in the BIB, are missing completely at random (Rubin, 
1976). Latent Gold 4.5 was again used to fit the latent-class SDT model to the incomplete data, 
and SAS macros were used to strip out and summarize the data. Some pilot simulations showed 
that estimation problems occurred, and in particular latent-class sizes of zero or large values of d 
(with large or indetenninate standard errors) were found. Using Bayes constants of one (for the 
latent and categorical options; see Vermunt & Magison, 2007) appeared to eliminate these 
problems (use of Bayes constants smooths the parameter estimates and helps to prevent 
boundary problems). Thus, they were used for the incomplete simulations presented next; note 
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that, with the use of Bayes constants, one is using posterior-mode estimation, which includes log 
priors in the log likelihood function. The priors act as a penalty for solutions that are too close to 
the boundary of the parameter space, with the result that the parameter estimates are smoothed 
away from the boundary, as noted by Vermunt and Magidson (2005). 

Results 

Rater parameters and latent-class sizes. Appendix C presents, for the BIB conditions, the 
mean estimated parameters, the bias, the percent bias, and the MSE. Tables C1-C3 show that, for 
values of d from 2-4, the percent bias is generally 10% or less for d, but ranges to over 60% for 
c; similarly, the MSE is generally less than 0.3 for d but is larger for c. Some patterns also appear 
across the tables—the percent bias tends to be largest for the first criterion, whereas the MSE 
tends to be largest for the last criterion. Tables C1-C3 also show that the bias for the latent-class 
sizes is generally less than 10% for latent classes of 2-5 (at least for ds of 2 and 3) but is large 
for the first and last classes (which have the smallest sizes), with a percent bias of up to 80%. 
Table C4 shows a condition with shifted criteria for a d of 3; the percent bias for d is again 
generally 10% or less. However, the percent bias for the response criteria and latent-class sizes 
tends to be larger, and so shifting the criteria led to somewhat larger bias in the criteria and size 
estimates. 

Compared to the fully crossed design, the bias and MSE are larger for the BIB design, as 
expected, because of the large number of missing values. Overall, however, Appendix C shows 
that estimation of d is good for values of d from 2-4. Estimation of the latent-class sizes is also 
adequate; however, the smallest latent-class sizes tend to be overestimated. 

Standard errors. Appendix D presents, for the BIB conditions, tables that examine the 
performance of the standard errors for d and the latent class sizes. The bias is generally small to 
moderate, 10-15% or less, which indicates that the standard errors are reasonably well estimated. 
A comparison of Appendix D to Appendix B shows that the standard errors are larger for the 
BIB conditions than for the fully crossed conditions; for example, the standard errors are about 
0.10 for a d of 3 in a fully crossed design but are about 0.45 for a d of 3 in a BIB design. This 
reflects the fact that less information is available in the BIB design with two raters per essay, as 
compared to a fully crossed design. 

Classification. Table 3 presents the proportion correctly classified for the BIB design. As 
for the fully crossed design, the proportion correctly classified increases as d varies from 2-4, 
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with the obtained PC ranging from about 52-80%; the Pearson correlation and taub also increase, 
from about .80 to over .90. The condition with shifted criteria shows that the criteria only have a 
small effect on the PC and association measures. For example, for a d of 3, PC 0 bt is 69.6% for 
intersection-point criteria and 68.2% for shifted criteria, which is a difference of less than 2%. 
Thus, shifting the criteria has little effect on classification accuracy (for model-based 
classifications, see below). Similarly, taub and r are both around .90 and differ across 
intersection-point and shifted criteria by only about .01. Table 3 also shows that PC pre d 
overestimates PC 0 bt, as found above for the fully crossed design and by DeCarlo (2005); for 
example, for a d of 3, PC pre d is .74, whereas PC 0 bt is .70. 

Table 3 

Proportion Correctly Classified and Correlations With True Latent Classes, Balanced 
Incomplete Block Design 


d 

PCpred 

PC 0 bt 

PC av 


^-obt 

taub 

r 

Intersection-point criteria 

2 

.623 

.525 

.575 

.478 

.360 

.792 

.866 

3 

.744 

.699 

.708 

.656 

.594 

.871 

.926 

4 

.843 

.799 

.803 

.788 

.729 

.911 

.951 

Shifted criteria 

3 

.721 

.682 

.617 

.627 

.572 

.871 

.925 

4 

.812 

.801 

.707 

.747 

.731 

.912 

.950 


Note. Two raters per essay. 


A comparison of Table 3 and Table 1 shows the effects on classification of using a BIB 
design over a fully crossed design. For example, for a d of 3, PC 0 bt is about 70% for the BIB with 
2 raters per essay and is 98% for the fully crossed design with 10 raters per essay. Thus, there is 
a large effect of the number of raters per essay on classification accuracy, as expected. Table 3 
also shows the proportion correctly classified using the average of the two scores, PC av . This is 
of interest because the simple average is commonly used in practice. To assess classification 
accuracy for cases where the average had in-between values, such as 2.5, the scores were 
rounded both up and down; the results differed by less than .003, and the larger values of PC av 
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are reported here. Table 3 shows that, as before, classification accuracy increases with d. 
However, in contrast to the model-based classifications, PC av is considerably lower for the 
shifted criteria. For example, for a J of 3, PC av was 70.8% for intersection-point criteria locations 
but only 61.7% for shifted criteria locations, a decrease of almost 10%. Thus, the results for PC av 
show that the proportion correctly classified drops considerably for average scores if there are 
differences in the response criteria locations across raters, as in the shifted criteria condition. In 
contrast, criteria shifts have little effect on model-based classifications (PC 0 bt). Thus, an 
advantage of the model-based approach is that classification accuracy is not affected by 
idiosyncrasies in raters’ response usage, whereas it is affected if average scores are used. 

Agreement. Table 4 shows agreement statistics, the proportion of exact agreement, and 
weighted kappa for the BIB simulation (note that Kendall’s W cannot be computed for the BIB 
design because of the missing values). Table 4 shows agreement between pairs of raters averaged 
over the 100 replications. A comparison of Table 4 to Table 2 shows only trivial differences, 
which is as expected. The only difference is that the fully crossed design first averages over the 
45 rater pairs for each essay, and then over the 100 replications, whereas for the BIB design, 
there is only 1 rater pair per essay, and so the averaging is only over the 100 replications. Table 4 
shows that agreement increases with the discrimination parameter d and that agreement is 
heavily affected by the response criteria, decreasing, for example, by 12% (50% to 38%) for a d 
of 3 and by 16% (64% to 48%) for a <7 of 4; the weighted kappas are also smaller. 

Table 4 


Agreement Proportions and Weighted Kappa 


Criteria 


d 



1 2 

3 

4 

5 

Intersection-point criteria 

Agreement 

.267 .368 

.505 

.636 

.751 

Weighted kappa 

.268 .488 

.643 

.748 

.832 

Shifted criteria 

Agreement 

.292 

.381 

.483 

— 

Weighted kappa 

.422 

.563 

.655 

— 
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Table 4 suggests that agreement statistics are informative about pairwise agreement, as 
they should be, but not about classification accuracy. For example, for a d of 3, Table 4 shows 
that agreement is only about 50% for intersection-point criteria and 38% for shifted criteria. 
However, Table 3 shows that classification accuracy is about 70% in both cases, and the 
measures of association are about .90. Thus, low agreement does not necessarily mean that the 
raters are performing poorly; the discrimination parameter is more infonnative in this regard, in 
that it provides infonnation about how well the raters discriminate the latent classes and about 
how accurately the items are classified. 

Discussion 

The simulations provide basic infonnation about various aspects of an approach to essay 
grading via SDT. First, the simulations show that estimation of the rater parameters, particularly 
the discrimination parameter, is good for both complete and incomplete designs (at least with the 
use of Bayes constants for incomplete designs), for the range of values examined here (d from 2— 
5), which are comparable to those found in practice (see below). It should be noted that there 
tend to be estimation problems in incomplete designs, but the use of Bayes constants (of one) 
and posterior-mode estimation gave good results. Estimation of the latent-class sizes also appears 
to be adequate across values of d from 2-4 (again with the use of Bayes constants for incomplete 
designs), at least for the normal-like distribution of latent-class sizes examined here (as found for 
real-world data below). Larger values of d, such as 5, can lead to convergence problems and poor 
estimation of the response criteria and latent-class sizes. Classification is also adversely affected; 
d, however, still appears to be adequately estimated, which means rater performance can still be 
evaluated. An argument can be made for the use of average ratings in the situation where 
estimation problems arise because of large values of d, in that classification accuracy should be 
high (for average ratings) even with differences in the criteria locations; the use of larger values 
of Bayes constants in that situation also can be explored. 

Tables 1-4 provide useful information about expected performance in a signal-detection 
task as a function of rater discrimination. For a fully crossed design with 10 raters, Table 1 
shows that classification accuracy is excellent (>90%) even for the lowest value of 
discrimination examined; this occurs because there are many raters per essay. Measures of 
agreement, such as the percent agreement and weighted kappa, tend to be considerably smaller. 
Of greater practical interest is that, for a BIB design with 2 raters per essay, Table 3 shows that 


17 



70% or more of the essays are correctly classified for values of d of 3 or larger, whereas 
agreement is 50% or more and weighted kappa is 64%. Table 3 provides guidelines as to the 
levels of classification accuracy and agreement associated with a particular level of rater 
performance. 

In light of the above, requiring a specified level of agreement is shown to be a 
conservative approach. For example, suppose a minimum of 70% agreement is required. Table 4 
shows that this is associated with a level of discrimination of greater than 4, which is quite good, 
whereas Table 3 suggests that a d of over 4 is associated with an obtained classification accuracy 
of over 80%. Thus, a specified level of agreement is a strict criterion, which is fine as long as this 
is understood. In some situations, it might be more useful simply to consider expected 
classification rates rather than agreement levels. For example, if classification accuracy is desired 
to be 70% or greater, then requiring (an average) rater discrimination of 3 or larger (for the 
logistic model) seems quite reasonable. The above also shows that agreement is heavily affected 
by the response criteria locations, as expected, and so agreement can be misleading with respect 
to how good classification is. Estimates of d and c, on the other hand, provide important 
information about classification accuracy and whether raters are performing adequately or not. 

ETS Data 

This section applies the latent-class signal-detection model to the writing section of 
several ETS datasets. This application provides information about parameter values found in 
practice. 

Example 1: Writing Assessment 

The data examined here are scores given to essays written by 10,647 examinees as part of 
a large-scale writing assessment (note that 17 essays were dropped because of one or more 
missing scores and 4 more were dropped because 2 raters scored only 2 essays each). The essays 
were scored by 44 raters, who used a 1-6 response scale (a response of zero is also possible but 
was not used for the subset of essays examined here). Each essay was scored by 2 raters, with the 
44 raters each scoring anywhere from 33-1404 essays. 

Differences in response category usage were noted across the 44 raters. For example, 9 
raters used all of the response categories, Categories 1-6, whereas 27 raters used Categories 2-6; 
5 raters used Categories 2-5; and 1 rater each used Categories 1-5, 3-6, or 3-5. In some cases, 


18 



the restricted response range likely occurred because of a small sample size, for example, the 
rater who only used Categories 3-5 scored just 33 essays, and the rater who used Categories 3-6 
scored 137 essays. Yet, this was not always the case—the rater who used Categories 1-5 scored 
897 essays. From the signal-detection perspective, the differences in response category usage 
reflect individual differences in the response criteria locations. Lack of response category usage 
has been discussed in the measurement literature as an issue of null categories (see Wilson & 
Masters, 1993); for the analysis presented here, the response categories were downcoded (i.e., 
2-6 becomes 1-5), which has no effect on the estimates of d (and ci becomes c\, etc.). The 
effect, if any, of downcoding on classification accuracy in the context of the latent-class SDT 
model is being examined in current research. 

The typical approach to arrive at a score for each essay is simply to add or average the 
two scores. This approach essentially treats the pool of raters for the first and second scores as 
being equivalent (for each score); that is, the data are pooled across raters and so are treated as 
being from a fully crossed design (i.e., there are two scores, collapsed across raters, for all 
essays). On the other hand, fitting the latent-class SDT model to the data in incomplete form, 
where dj and Cjk are treated as rater-specific fixed effects, allows examination of any differences 
across the 44 raters who actually provided the scores. 

Figure 3 shows a histogram of the estimates of dj for the 44 raters, obtained by fitting the 
model to the data in incomplete fonn (again using Bayes constants of one). The estimates of d 
have a mean of 3.5 with a range of 1.9-5.4 and a standard deviation of 0.9. The estimates are 
approximately normally distributed, with a (Fisher’s g) skew of 0.05 (SE of 0.36) and kurtosis of 
-0.43 (SE of 0.70). Thus, there appear to be differences in discrimination across the 44 raters. 

Figure 4 presents a plot of the relative criteria (DeCarlo, 2005) for the 44 raters who 
scored the test. The relative criteria are 


rel c jk = Cj k / (K- i) dj, (5) 

where K is the number of latent classes and the estimates obtained for cjk and dj are used in the 
above. Equation 5 equalizes the location of the highest and lowest distributions across raters; for 
example, the lowest distribution is set at 0 and the highest at 1 in Figure 4. The horizontal lines 
show the intersection-point locations of the five response criteria, that is, the crossover points of 
the symmetric underlying distributions. Thus, Figure 4 compactly shows the locations of each 


19 



rater’s response criteria, relative to the intersection points of the six underlying distributions, 
which is informative about the raters’ use of the response categories. For example, for Rater 1, 
Figure 4 shows that the first two criteria are below the intersection point of the first and second 
distribution (and so Rater 1 is somewhat conservative with respect to giving responses of 
Categories 1 and 2), whereas the third, fourth, and fifth criteria are at the second, fourth, and fifth 
intersection points. 



Figure 3. Distribution of d for the writing test. 

Overall, Figure 4 shows differences between raters in response category usage, both in 
terms of criteria locations and the number of categories used (as noted above). An interesting 
result that is apparent in the figure is that the raters tend to be conservative with respect to their 
use of the lower response categories, in that the first and sometimes second criteria tend to be 
well below the intersection-point locations (keeping in mind that in situations with four instead 
of five criteria, the first response was usually 2, and so the first criterion shown is actually C 2 , not 
ci). Figure 4 also shows that, in general, there are criteria that tend to lie on the second and fourth 
intersection points, whereas the first and last criteria tend to be below and above the first and last 
intersection points, respectively. Thus, Figure 4 shows that, according to the SDT model, raters 
appear to be conservative with respect to using categories such as 1 and 6, in that the 


20 



















corresponding criteria tend to be well above or below the intersection point (which means that 
those responses are used less frequently). 
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Figure 4. Relative criteria locations for 44 raters. 

The estimates of the latent-class sizes (with standard errors in parentheses) are .02 (.002), 
.14 (.01), .19 (.01), .46 (.02), .17 (.01), and .02 (.006). Thus, latent-class sizes of 1 and 6 are 
small, with the largest latent-class size being 4. Note that although latent classes of 1 and 6 have 
small sizes, the standard errors are small (because of the large sample size). Also note that the 
latent classes are approximately normally distributed, with some small, negative skew. The 
predicted proportion correctly classified, PC pre d, is .74, which is consistent with values found in 
the simulation for ds of 3-4. For example, PC 0 bt in Table 3 suggests that 70-80% of the cases 
might be classified correctly. A simulation using the obtained parameter estimates can be 
conducted to gain more detailed information about likely classification accuracy. 

In sum, an application of latent-class SDT to a writing assessment offers new and 
interesting results. First, it suggests that rater performance for the test is very good (average d of 
3.5), with some differences across raters in discrimination and response criteria. This, along with 
PCpred, suggests that classification accuracy is likely 70% or more. Second, the estimates of the 
response criteria suggest that the raters are conservative with respect to use of the lowest 
response categories and the highest category; this has never, to my knowledge, been noted before 
and merits further attention. It could occur, for example, because of the scoring rubric or other 
instructions given to the raters (e.g., that might lead them to believe that the lowest and highest 


21 




categories occur less frequently than they actually do). It is also interesting to note that this was 
not found for the analysis of data from the next test examined. Third, the estimated latent-class 
sizes suggest that most of the essays are classified into Category 4, with only about 2% classified 
into Categories 1 and 6; the distribution of the latent classes is also close to that for a nonnal 
distribution (with a small negative skew). 

Example 2: Writing Assessment 

For the second large-scale writing test examined, the scoring rubric consisted of 
categories from 1-5 (there is also a 0 category for essays that have, for example, little or no text 
or are not on the assigned topic; in this example, only 124 essays out of over 42,000 received 
scores of zero and so were not included in the analysis). This section presents an analysis of data 
from 42,608 examinees (after dropping 69 cases because of missing values and 124 more cases 
with zeroes), with 2 essays per examinee. Each essay had two scores (from various raters, the 
pooled data are analyzed here), with a third score given by an adjudicator when the two scores 
differed by 2 or more. For the first writing task, 3.9% of the essays had third (adjudicated) 
scores, whereas 2.6% of the essays for the second task had third scores. Data for the third scores 
can be viewed as being missing at random (Rubin, 1976), in that the probability that a value is 
missing is detennined by an observed variable—the difference between the two observed scores 
(i.e., the value is missing if it is less than 2). The analyses presented here include the third scores. 
DeCarlo and Kim (2008) showed that estimation is good for adjudicated scores as along as a 
sufficient number are available, as is typically the case for large-scale assessments. Latent-class 
SDT models with five latent classes are fit to the data. In this analysis, the data are treated as 
coming from a fully crossed design (i.e., the data are pooled across raters) in order to obtain 
information about results for pooled data, as are commonly analyzed. Also, the data are being 
used in that form in current research on a hierarchical rater model. 

Table 5 presents results for the writing task. A latent-class, logistic, SDT model was fit to 
the data using all three scores, where the third score was only available for the adjudicated cases 
(3.9%). Table 5 shows that the estimates of d are similar across the three scores, being 3.8, 3.9, 
and 3.4, respectively. Note that the standard errors are larger for Score 3 because 96.1% of the 
scores were missing (there were still 1,661 third scores available). It is interesting to note that, 
despite the high degree of missing scores, the estimates of Cjk and dj for the third score are close 
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to those obtained for the other two scores, which indicates that the behavior of the adjudicators 
was similar to that of the other raters. 


Table 5 

Results for the Second Writing Test Treated as a Fully Crossed Design 


Parameter 

Score 1 

Score 2 

Score 3 

Estimate 

SE 

Estimate 

SE 

Estimate 

SE 

d 

3.77 

0.05 

3.88 

0.06 

3.39 

0.20 

Cl 

1.73 

0.07 

1.84 

0.07 

1.26 

0.29 

C2 

5.47 

0.11 

5.59 

0.12 

4.43 

0.33 

C3 

8.98 

0.12 

9.22 

0.14 

7.86 

0.45 

c 4 

12.35 

0.16 

12.67 

0.17 

11.03 

0.59 


Table 5 shows that discrimination is again in the range of 3-4 (keeping in mind that five 
latent classes were used for this example, whereas six were used for the first example). It is also 
apparent that discrimination for the adjudicated score is about the same in magnitude as for the 
other two scores. The criteria estimates are also similar in magnitude across the three scores, 
with the criteria for Score 3 being slightly to the left of the other two scores, which indicates that 
the criteria for the third score were slightly more liberal (i.e., higher responses were used) than 
those for the other two scores. It is also interesting to note that, in all cases, the response criteria 
estimates in Table 5 are close to their intersection-point locations. For example, using the 
parameter estimate for d for the first score (3.8), intersection-point criteria locations will be at 
1.9, 5.7, 9.5, and 13.3, which are close in value to the estimates shown in Table 5 (1.7, 5.5, 9.0, 
and 12.3). Thus, it appears that, for pooled data, the response criteria tend to be located close to 
the intersection points of the underlying logistic distributions, which to my knowledge has not 
been noted before. 

With respect to the latent-class sizes (Table 6), the end categories are smallest, being .12 
and .08 for Categories 1 and 5, respectively, and are larger than that found for the first test 
examined above. The largest latent-class size is for Category 3, followed by Category 4; note that 
the latent-class sizes are also approximately normally distributed. The estimate of PC pre d is .84. 
Just considering Scores 1 and 2, the agreement proportion is .58 and weighted kappa is .67. 
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Table 6 

Latent Class Sizes for the Second Writing Test 


Category 

Pi 

P2 

Pi 

P4 

Ps 

Estimate 

.12 

.17 

.34 

.29 

.08 

SE 

< .01 

<.01 

< .01 

< .01 

<.01 


In sum, the results for the writing sections of several ETS tests showed consistent results 
with respect to rater parameters—discrimination appears to be in the range of 3-4 for a logistic 
SDT model (with five or six latent classes). Note that the SDT approach focuses attention on 
effect sizes for the raters, namely the magnitude of discrimination as measured by d, which is 
informative about rater perfonnance and classification accuracy. The response criteria for raters 
in the second test appeared to be close to their intersection-point locations. The results also 
showed that the distribution of latent classes is slightly asymmetric (negatively skewed), but 
close to (discrete) nonnal. 


Summary and Conclusions 

The present report lays out the scope and potential of an approach to essay grading via a 
latent-class extension of SDT. The simulations provide basic information about parameter 
estimation and about the relation between discrimination, classification, and agreement; the real- 
world analyses provide information about values of the rater parameters and latent-class sizes 
that are found in practice. 

The approach via SDT also informs several issues. For example, why is agreement of 
interest in scoring tasks such as essay grading? The answer is that high agreement suggests that 
the raters are detecting a construct, such as the latent classes defined in the scoring rubric. 
However, as shown here, agreement is at most only an indirect indicator of rater performance, in 
that it depends not only on the raters’ ability to discriminate between the latent classes, but also 
on their use of response criteria. For example, as shown here, the raters’ discrimination can be 
quite high and classification accuracy can be high, yet agreement can be quite low (e.g., if the 
criteria differ across raters). Thus, agreement only provides an indirect assessment of what is 
really of interest, which is how well the raters classify the essays. Here it is noted that estimates 
of the raters’ parameters, particularly d, are informative about rater performance and 
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classification accuracy. The recommendation is to supplement agreement statistics with 
estimates of the rater parameters d and c; at the least, this might provide information as to why 
raters disagree (see DeCarlo, 2002, for another example). 

An implication of the above for rater training, which was noted earlier (DeCarlo, 2002), 
is that it is probably more effective to monitor raters in terms of their discrimination parameter 
than by their level of agreement. Use of agreement might be unnecessarily strict. For example, it 
might suggest rater retraining or elimination in situations where it is not necessary, in that the 
rater discriminates adequately but has different response criteria. The estimate of d will provide 
valuable information about rater performance in that case, and the use of model-based 
classifications will likely have benefits as well. Thus, the approach via SDT might have cost 
benefits with respect to reducing unnecessary elimination or retraining of raters. Given that the 
above simulations showed that the discrimination parameter was accurately estimated for the 
range of values that appear to be found in practice, the latent-class SDT model should be a useful 
tool for monitoring rater performance. 

It also was shown that adjudicated cases can be included in the analysis. For an analysis 
of essays pooled across raters, discrimination was about the same for adjudicated essays as it was 
for the other essays. This finding makes sense for ETS tests, because the adjudicators are 
sometimes chosen from the general pool of raters, and so they should have similar discrimination 
to other raters. However, in some cases in the psychometrics literature, adjudicators are assumed 
to be experts; note that the latent-class SDT model allows assessment of expertise by using the 
data of all of the raters (or scores), as done above, and comparing the parameters across raters 
(experts should show large values of d and appropriate criteria locations). In the same way, the 
latent-class SDT model allows one to evaluate presumed gold standards used in medical and 
other research, as noted earlier (DeCarlo, 2002). 

Similarities across tests used by ETS were also found. For example, rater similarities 
were found across the essays used in the writing sections of the first and second tests, in that 
discrimination tended to be in the range of 3-4 for the logistic model. Note that this was also 
found for an analysis of a large sample of SAT essays (where ds of 3.5 and 3.1 were found; 
DeCarlo & Kim, 2008). It is also interesting to note that, for an analysis of a small sample (125) 
of college data scored by nonexperts (graduate students), the average value of d was 2.1 (for the 
logistic model; see DeCarlo, 2005), which is smaller than that found for the professional raters 
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used in the large-scale assessments examined here (where values of d from 3-4 were found). 
This difference could reflect differences in the raters’ experience or differences in the quality of 
the essay item or the scoring rubric; further research on this is needed. In any case, there are 
clearly interesting patterns of results with respect to d, found both here and in previous studies. 
The latent-class SDT model offers a new perspective with which to examine CR data and 
suggests new directions for future research. 
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Notes 


1 Lawrence DeCarlo wrote this paper while under contract to ETS. 

“ It is interesting to note that the deviations from the intersection-point criteria locations appear to 
be in the direction of where optimal criteria would be located (see Wickens, 2002). Because 
Classes 1 and 6 have small sizes, the optimal criteria should be further to the left for c\ and 
further to the right for cs, which is exactly what was found. The estimates are, however, 
further to the left or right than the optimal locations. 
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Appendix A 

Parameter Estimates, Bias, Percent Bias, and Mean Squared Error for the Fully Crossed 

Conditions With 10 Raters 


Table A1 


Intersection Point Criteria, Fully Crossed, d = 2, N = 1,080 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

d i 

2 

2.022 

0.022 

1.100 

0.006 

d2 

2 

2.003 

0.003 

0.150 

0.008 

di 

2 

1.992 

-0.008 

0.400 

0.006 

d4 

2 

2.010 

0.010 

0.500 

0.005 

ds 

2 

2.000 

0.000 

0.000 

0.005 

d, 5 

2 

2.000 

0.000 

0.000 

0.006 

dy 

2 

2.000 

0.000 

0.000 

0.006 

dg 

2 

2.005 

0.005 

0.250 

0.007 

dg 

2 

2.003 

0.003 

0.150 

0.005 

dw 

2 

1.999 

- 0.001 

0.050 

0.006 

Cll 

1 

1.006 

0.006 

0.600 

0.031 

C12 

3 

3.036 

0.036 

1.200 

0.030 

Cl3 

5 

5.049 

0.049 

0.980 

0.057 

Cm 

7 

7.065 

0.065 

0.928 

0.075 

Cl5 

9 

9.088 

0.088 

0.977 

0.103 

C21 

1 

1.009 

0.009 

0.900 

0.031 

C22 

3 

3.014 

0.014 

0.466 

0.033 

C23 

5 

5.022 

0.022 

0.440 

0.046 

C24 

7 

7.020 

0.020 

0.285 

0.088 

C25 

9 

9.041 

0.041 

0.455 

0.131 

C31 

1 

0.981 

-0.019 

1.900 

0.023 

C32 

3 

2.983 

-0.016 

0.533 

0.031 

C33 

5 

4.980 

-0.020 

0.400 

0.041 

C34 

7 

6.977 

-0.023 

0.328 

0.063 

C35 

9 

8.981 

-0.019 

0.211 

0.088 

C41 

1 

1.001 

0.001 

0.100 

0.025 

C42 

3 

3.014 

0.014 

0.466 

0.027 

C43 

5 

5.034 

0.034 

0.680 

0.038 

C44 

7 

7.025 

0.025 

0.357 

0.059 

C45 

9 

9.034 

0.034 

0.377 

0.086 

C51 

1 

0.992 

-0.008 

0.800 

0.022 

C52 

3 

3.013 

0.013 

0.433 

0.028 

C53 

5 

5.004 

0.004 

0.080 

0.045 


(Table continues) 
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Table A1 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

C 54 

7 

7.014 

0.014 

0.200 

0.071 

C 55 

9 

9.038 

0.038 

0.422 

0.092 

C6i 

1 

0.998 

-0.012 

1.200 

0.023 

C62 

3 

3.007 

0.007 

0.233 

0.033 

C63 

5 

4.994 

-0.006 

0.120 

0.048 

C64 

7 

7.013 

0.013 

0.185 

0.072 

C65 

9 

9.009 

0.009 

0.100 

0.093 

C71 

1 

0.987 

-0.013 

1.300 

0.023 

C 72 

3 

3.013 

0.013 

0.433 

0.029 

C73 

5 

5.011 

0.011 

0.220 

0.052 

C 74 

7 

7.016 

0.016 

0.280 

0.083 

C 75 

9 

9.023 

0.023 

0.255 

0.113 

C81 

1 

0.999 

- 0.001 

0.100 

0.030 

C82 

3 

2.990 

- 0.010 

0.333 

0.033 

C83 

5 

4.981 

-0.019 

0.380 

0.045 

C84 

7 

6.979 

-0.021 

0.300 

0.072 

C85 

9 

8.983 

-0.017 

0.188 

0.099 

C91 

1 

0.998 

-0.002 

0.200 

0.025 

C92 

3 

3.010 

0.010 

0.333 

0.032 

C93 

5 

5.013 

0.013 

0.260 

0.042 

C 94 

7 

7.002 

0.002 

0.028 

0.071 

C 95 

9 

8.995 

-0.005 

0.055 

0.081 

cm 

1 

1.007 

0.007 

0.700 

0.030 

C 102 

3 

2.992 

-0.008 

0.266 

0.029 

C 103 

5 

4.998 

-0.002 

0.040 

0.050 

Cl04 

7 

7.010 

0.010 

0.142 

0.075 

cm 

9 

9.024 

0.024 

0.266 

0.091 

Latent-class sizes 

Class 1 

0.080 

0.080 

0.000 

0.000 


Class 2 

0.170 

0.169 

0.001 

0.589 


Class 3 

0.250 

0.251 

- 0.001 

0.400 


Class 4 

0.250 

0.250 

0.000 

0.000 


Class 5 

0.170 

0.170 

0.000 

0.000 


Class 6 

0.080 

0.081 

- 0.001 

1.250 
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Table A2 


Intersection-Point Criteria, Fully Crossed, d = 3, N = 1,080 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

di 

3 

3.007 

0.007 

0.233 

0.009 

d 2 

3 

3.024 

0.024 

0.800 

0.011 

d 2 

3 

3.038 

0.038 

1.267 

0.013 

d4 

3 

3.006 

0.006 

0.200 

0.010 

ds 

3 

3.017 

0.017 

0.567 

0.012 

d 6 

3 

3.013 

0.013 

0.433 

0.011 

dy 

3 

3.031 

0.031 

1.033 

0.011 

da 

3 

3.009 

0.009 

0.300 

0.009 

di 9 

3 

3.009 

0.009 

0.300 

0.013 

dw 

3 

3.030 

0.030 

1.000 

0.010 

Cll 

1.5 

1.500 

0.000 

0.000 

0.037 

C 12 

4.5 

4.529 

0.029 

0.644 

0.040 

C13 

7.5 

7.545 

0.045 

0.600 

0.071 

Cm 

10.5 

10.540 

0.040 

0.381 

0.119 

C15 

13.5 

13.536 

0.036 

0.267 

0.177 

C 21 

1.5 

1.511 

0.011 

0.733 

0.028 

C 22 

4.5 

4.545 

0.045 

1.000 

0.440 

C23 

7.5 

7.567 

0.067 

0.893 

0.078 

C 2 4 

10.5 

10.612 

0.112 

1.067 

0.135 

C 2 5 

13.5 

13.591 

0.091 

0.674 

0.232 

C31 

1.5 

1.533 

0.033 

2.200 

0.040 

C32 

4.5 

4.584 

0.084 

1.867 

0.050 

C33 

7.5 

7.591 

0.091 

1.213 

0.092 

C34 

10.5 

10.625 

0.125 

1.190 

0.149 

C35 

13.5 

13.634 

0.134 

0.993 

0.238 

C41 

1.5 

1.511 

0.011 

0.733 

0.040 

C42 

4.5 

4.505 

0.005 

0.111 

0.045 

C43 

7.5 

7.497 

-0.003 

0.040 

0.076 

C 44 

10.5 

10.509 

0.009 

0.085 

0.122 

C45 

13.5 

13.532 

0.032 

0.237 

0.204 

C51 

1.5 

1.537 

0.037 

2.467 

0.033 

C52 

4.5 

4.524 

0.024 

0.533 

0.036 

C53 

7.5 

7.554 

0.054 

0.720 

0.079 

C54 

10.5 

10.573 

0.073 

0.695 

0.124 

C55 

13.5 

13.579 

0.079 

0.585 

0.222 

C61 

1.5 

1.502 

0.002 

0.133 

0.025 

C62 

4.5 

4.505 

0.005 

0.111 

0.040 

C63 

7.5 

7.534 

0.034 

0.453 

0.092 

C64 

10.5 

10.555 

0.055 

0.524 

0.155 


(Table continues) 
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Table A2 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

C65 

13.5 

13.549 

0.049 

0.363 

0.204 

C 7 1 

1.5 

1.556 

0.056 

3.733 

0.034 

C72 

4.5 

4.567 

0.067 

1.489 

0.045 

C73 

7.5 

7.567 

0.067 

0.893 

0.080 

C 74 

10.5 

10.613 

0.113 

1.076 

0.143 

C75 

13.5 

13.630 

0.130 

0.963 

0.211 

C81 

1.5 

1.484 

-0.016 

1.067 

0.030 

C82 

4.5 

4.528 

0.028 

0.622 

0.045 

C83 

7.5 

7.543 

0.043 

0.573 

0.069 

C84 

10.5 

10.546 

0.046 

0.438 

0.122 

C85 

13.5 

13.554 

0.054 

0.400 

0.184 

C91 

1.5 

1.513 

0.013 

0.866 

0.039 

C92 

4.5 

4.522 

0.022 

0.489 

0.047 

C93 

7.5 

7.515 

0.015 

0.200 

0.090 

C 94 

10.5 

10.518 

0.018 

0.171 

0.139 

C 95 

13.5 

13.530 

0.030 

0.222 

0.232 

Cioi 

1.5 

1.531 

0.031 

2.067 

0.027 

C102 

4.5 

4.543 

0.043 

0.955 

0.034 

C103 

7.5 

7.578 

0.078 

1.040 

0.074 

C104 

10.5 

10.608 

0.108 

1.028 

0.125 

cm 

13.5 

13.641 

0.014 

0.104 

0.215 

Latent-class sizes 

Class 1 

0.080 

0.079 

- 0.001 

1.250 


Class 2 

0.170 

0.173 

0.003 

1.765 


Class 3 

0.250 

0.248 

-0.002 

0.800 


Class 4 

0.250 

0.251 

0.001 

0.400 


Class 5 

0.170 

0.169 

- 0.001 

0.588 


Class 6 

0.080 

0.080 

0.000 

0.000 



Table A3 






Intersection-Point Criteria, Fully Crossed, d = 4, N= 1,080 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

di 

4 

4.009 

0.009 

0.225 

0.017 

d2 

4 

4.021 

0.021 

0.525 

0.020 

d 3 

4 

4.010 

0.010 

0.250 

0.014 

d4 

4 

4.008 

0.008 

0.200 

0.015 


(Table continues) 
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Table A3 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

ds 

4 

4.027 

0.027 

0.675 

0.016 

d ( 

4 

4.012 

0.012 

0.300 

0.018 

dy 

4 

4.018 

0.018 

0.450 

0.018 

da 

4 

4.018 

0.018 

0.450 

0.018 

dy 

4 

4.028 

0.028 

0.700 

0.018 

dw 

4 

4.007 

0.007 

0.175 

0.018 

Cll 

2 

2.003 

0.034 

1.700 

0.032 

C12 

6 

6.033 

0.033 

0.550 

0.054 

C13 

10 

10.042 

0.042 

0.420 

0.125 

C14 

14 

14.047 

0.047 

0.335 

0.201 

C15 

18 

18.042 

0.042 

0.233 

0.343 

C21 

2 

1.997 

-0.003 

0.150 

0.056 

C22 

6 

6.027 

0.027 

0.450 

0.063 

C23 

10 

10.062 

0.062 

0.620 

0.154 

C24 

14 

14.105 

0.105 

0.750 

0.278 

C25 

18 

18.088 

0.088 

0.488 

0.377 

C31 

2 

1.986 

-0.014 

0.300 

0.047 

C32 

6 

6.003 

0.003 

0.050 

0.071 

C33 

10 

10.056 

0.036 

0.360 

0.104 

C 34 

14 

14.012 

0.012 

0.085 

0.177 

C35 

18 

18.032 

0.032 

0.177 

0.280 

C 41 

2 

1.975 

-0.025 

1.250 

0.044 

C42 

6 

6.008 

0.008 

0.133 

0.061 

C43 

10 

10.026 

0.026 

0.260 

0.114 

C 44 

14 

14.017 

0.017 

0.121 

0.195 

C45 

18 

18.060 

0.060 

0.333 

0.342 

C51 

2 

2.009 

0.009 

0.450 

0.041 

C52 

6 

6.033 

0.033 

0.550 

0.065 

C53 

10 

10.067 

0.067 

0.670 

0.106 

C54 

14 

14.090 

0.090 

0.642 

0.227 

C 55 

18 

18.129 

0.129 

0.716 

0.349 

C61 

2 

1.965 

-0.035 

1.750 

0.041 

C62 

6 

6.023 

0.023 

0.383 

0.079 

C63 

10 

10.034 

0.034 

0.340 

0.130 

C64 

14 

14.049 

0.049 

0.350 

0.213 

C65 

18 

18.041 

0.041 

0.227 

0.417 

C71 

2 

1.985 

-0.015 

0.750 

0.051 

C72 

6 

6.024 

0.024 

0.400 

0.066 

C73 

10 

10.049 

0.049 

0.490 

0.129 

C 74 

14 

14.046 

0.046 

0.328 

0.228 


(Table continues) 
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Table A3 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

C75 

18 

18.102 

0.102 

0.566 

0.377 

c 8 i 

2 

2.012 

0.012 

0.600 

0.048 

C82 

6 

6.017 

0.017 

0.283 

0.076 

C83 

10 

10.054 

0.017 

0.170 

0.138 

C84 

14 

14.053 

0.017 

0.121 

0.242 

C85 

18 

18.080 

0.080 

0.444 

0.387 

C91 

2 

2.024 

0.024 

1.200 

0.037 

C92 

6 

6.023 

0.023 

0.383 

0.054 

C93 

10 

10.083 

0.083 

0.830 

0.136 

C94 

14 

14.107 

0.107 

0.764 

0.213 

C95 

18 

18.128 

0.128 

0.711 

0.325 

ClOl 

2 

2.011 

0.011 

0.550 

0.044 

C102 

6 

6.005 

0.005 

0.083 

0.067 

C103 

10 

10.009 

0.009 

0.090 

0.119 

Cl04 

14 

14.006 

0.006 

0.042 

0.252 

C105 

18 

18.001 

0.001 

0.005 

0.357 

Latent-class sizes 

Class 1 

0.080 

0.081 

0.001 

1.250 


Class 2 

0.170 

0.173 

0.003 

1.765 


Class 3 

0.250 

0.250 

0.000 

0.000 


Class 4 

0.250 

0.249 

- 0.001 

0.400 


Class 5 

0.170 

0.169 

- 0.001 

0.588 


Class 6 

0.080 

0.079 

- 0.001 

1.250 



Table A4 

Intersection-Point Criteria, Fully Crossed, d = 5, N = 1,080 

Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

di 

5 

5.182 

0.182 

3.640 

0.081 

dj 

5 

5.078 

0.078 

1.560 

0.038 

di 

5 

5.083 

0.083 

1.660 

0.039 

d4 

5 

5.112 

0.111 

2.220 

0.049 

di 

5 

5.121 

0.121 

2.420 

0.049 

d 6 

5 

5.160 

0.160 

3.200 

0.072 

dy 

5 

5.132 

0.132 

2.640 

0.066 

d 8 

5 

5.117 

0.117 

2.340 

0.048 

d9 

5 

5.119 

0.119 

2.380 

0.053 


(Table continues) 
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Table A4 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

dio 

5 

5.099 

0.099 

1.980 

0.030 

Cll 

2.5 

2.874 

0.374 

14.960 

13.727 

C12 

7.5 

7.285 

-0.215 

2.866 

18.838 

Cl3 

12.5 

12.433 

-0.067 

0.536 

20.090 

Cm 

17.5 

17.526 

0.026 

0.148 

20.020 

Cl5 

22.5 

22.204 

-0.295 

1.311 

14.479 

C21 

2.5 

2.738 

0.238 

9.520 

12.378 

C22 

7.5 

7.127 

-0.373 

4.973 

17.725 

C23 

12.5 

12.172 

-0.328 

2.624 

18.244 

C24 

17.5 

17.188 

-0.311 

1.777 

17.150 

C25 

22.5 

21.766 

-0.734 

3.262 

12.643 

C31 

2.5 

2.713 

0.213 

8.520 

11.970 

C32 

7.5 

7.191 

-0.309 

4.120 

17.564 

C33 

12.5 

12.143 

-0.357 

2.856 

17.881 

C34 

17.5 

17.196 

-0.304 

1.737 

17.239 

C35 

22.5 

21.196 

-0.708 

3.146 

12.490 

C41 

2.5 

2.779 

0.279 

11.160 

12.301 

C42 

7.5 

7.231 

-0.268 

3.573 

18.071 

C43 

12.5 

12.228 

-0.272 

2.176 

18.800 

C44 

17.5 

17.279 

-0.224 

1.280 

18.123 

C45 

22.5 

21.859 

-0.641 

2.848 

13.419 

C51 

2.5 

2.745 

0.245 

9.800 

12.461 

C52 

7.5 

7.211 

-0.289 

3.853 

18.537 

C53 

12.5 

12.554 

-0.238 

1.904 

18.927 

C54 

17.5 

17.296 

-0.204 

1.165 

18.262 

C55 

22.5 

21.963 

-0.536 

2.382 

12.691 

C61 

2.5 

2.777 

-0.277 

11.080 

12.484 

C62 

7.5 

7.271 

-0.229 

3.053 

18.279 

C63 

12.5 

12.393 

-0.107 

0.826 

17.977 

C64 

17.5 

17.458 

-0.042 

0.240 

17.910 

C65 

22.5 

22.098 

-0.402 

1.726 

12.967 

C71 

2.5 

2.797 

0.297 

11.880 

12.937 

C72 

7.5 

7.216 

-0.284 

3.786 

18.294 

C73 

12.5 

12.271 

-0.228 

1.824 

19.218 

C74 

17.5 

17.340 

-0.160 

0.914 

18.552 

C75 

22.5 

21.951 

-0.549 

2.440 

13.467 

C81 

2.5 

2.795 

0.295 

11.800 

12.718 

C82 

7.5 

7.195 

-0.305 

3.720 

17.409 

C83 

12.5 

12.264 

-0.236 

1.832 

17.837 

C84 

17.5 

17.322 

-0.177 

0.931 

16.621 


(Table continues) 
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Table A4 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

C85 

22.5 

21.864 

-0.636 

2.297 

12.453 

C91 

2.5 

2.774 

0.274 

10.960 

12.534 

C92 

7.5 

7.221 

-0.279 

3.720 

18.044 

C93 

12.5 

12.238 

-0.229 

1.832 

18.463 

C94 

17.5 

17.336 

-0.163 

0.931 

18.268 

C95 

22.5 

21.983 

-0.517 

2.297 

13.056 

cm 

2.5 

2.803 

0.303 

12.120 

12.448 

C 102 

7.5 

7.215 

-0.285 

3.800 

18.148 

cm 

12.5 

12.238 

-0.261 

2.088 

18.718 

C104 

17.5 

17.327 

-0.173 

0.988 

18.601 

cm 

22.5 

21.749 

-0.751 

3.333 

14.046 

Latent-class sizes 

Class 1 

0.080 

0.123 

0.043 

53.750 


Class 2 

0.170 

0.174 

0.004 

2.352 


Class 3 

0.250 

0.229 

-0.021 

8.400 


Class 4 

0.250 

0.218 

-0.032 

12.800 


Class 5 

0.170 

0.155 

-0.015 

8.823 


Class 6 

0.080 

0.101 

0.021 

26.250 



Table A5 

Shifted Criteria, Fully Crossed, d = 

2, N = 1,080 



Parameter 

Value 

Estimate Bias 

% Bias 

MSE 

Rater parameters 

di 

2 

2.008 

0.008 

0.400 

0.007 

d 2 

2 

2.001 

0.001 

0.050 

0.006 

di 

2 

1.999 

- 0.001 

0.050 

0.008 

d4 

2 

2.009 

0.009 

0.450 

0.006 

di 

2 

1.999 

-0.001 

0.050 

0.008 

d 6 

2 

2.012 

0.012 

0.600 

0.006 

d-/ 

2 

2.007 

0.007 

0.350 

0.006 

d 8 

2 

2.002 

0.002 

0.100 

0.006 

dg 

2 

2.015 

0.015 

0.750 

0.006 

dio 

2 

2.005 

0.005 

0.250 

0.006 

C 11 

-1 

-1.014 

-0.014 

1.400 

0.042 

C 12 

1 

1.005 

0.005 

0.500 

0.024 

C13 

3 

3.024 

0.024 

0.800 

0.032 

Cl4 

5 

5.021 

0.021 

0.420 

0.048 


(Table continues) 
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Table A5 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

Cl5 

7 

7.021 

0.021 

0.300 

0.073 

C21 

-1 

-0.983 

0.017 

1.700 

0.034 

C22 

1 

1.011 

0.011 

1.100 

0.023 

C23 

3 

2.993 

-0.007 

0.233 

0.037 

C24 

5 

5.011 

0.011 

0.220 

0.054 

C25 

7 

7.014 

0.014 

0.200 

0.080 

C31 

0 

-0.013 

-0.013 

— 

0.029 

C32 

2 

1.982 

-0.018 

0.900 

0.033 

C33 

4 

3.972 

-0.028 

0.700 

0.043 

C34 

6 

5.978 

-0.022 

0.367 

0.059 

C35 

8 

7.975 

-0.025 

0.313 

0.084 

C41 

0 

0.019 

0.019 

— 

0.031 

C42 

2 

2.031 

0.031 

1.550 

0.029 

C43 

4 

4.034 

0.034 

0.850 

0.046 

C44 

6 

6.050 

0.050 

0.833 

0.072 

C45 

8 

8.058 

0.058 

0.725 

0.115 

C51 

1 

0.971 

-0.029 

2.900 

0.028 

C52 

3 

3.013 

0.013 

0.433 

0.030 

C53 

5 

5.014 

0.014 

0.280 

0.048 

C54 

7 

6.994 

-0.006 

0.086 

0.070 

C55 

9 

9.017 

0.017 

0.189 

0.092 

C61 

1 

1.019 

0.019 

1.900 

0.035 

C62 

3 

3.034 

0.034 

1.133 

0.038 

C63 

5 

5.045 

0.045 

0.900 

0.061 

C64 

7 

7.064 

0.064 

0.914 

0.093 

C65 

9 

9.055 

0.055 

0.611 

0.126 

C71 

2 

2.004 

0.004 

0.200 

0.031 

C 7 2 

4 

4.019 

0.019 

0.475 

0.039 

C73 

6 

6.018 

0.018 

0.300 

0.065 

C74 

8 

8.036 

0.036 

0.450 

0.108 

C75 

10 

10.053 

0.053 

0.530 

0.143 

C81 

2 

2.007 

0.007 

0.350 

0.030 

C82 

4 

4.010 

0.010 

0.250 

0.042 

C83 

6 

6.020 

0.020 

0.333 

0.062 

C84 

8 

8.018 

0.018 

0.225 

0.097 

C85 

10 

10.025 

0.025 

0.250 

0.137 

C91 

3 

3.035 

0.035 

1.167 

0.032 

C92 

5 

5.037 

0.037 

0.740 

0.047 

C93 

7 

7.042 

0.042 

0.600 

0.062 


(Table continues) 
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Table A5 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

C 94 

9 

9.052 

0.052 

0.578 

0.105 

C 95 

11 

11.068 

0.068 

0.618 

0.168 

cm 

3 

3.032 

0.032 

1.067 

0.031 

cm 

5 

5.029 

0.029 

0.580 

0.051 

C 103 

7 

7.021 

0.021 

0.300 

0.073 

Cl04 

9 

9.023 

0.023 

0.256 

0.106 

C105 

11 

11.035 

0.035 

0.318 

0.165 

Latent-class sizes 

Class 1 

0.080 

0.080 

0.000 

0.000 


Class 2 

0.170 

0.169 

- 0.001 

0.588 


Class 3 

0.250 

0.248 

-0.002 

0.800 


Class 4 

0.250 

0.251 

0.001 

0.400 


Class 5 

0.170 

0.171 

0.001 

0.588 


Class 6 

0.080 

0.081 

0.001 

1.250 



Table A 6 

Shifted Criteria, Fully Crossed, d = 3, N = 

= 1,080 



Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

di 

3 

3.008 

0.008 

0.267 

0.014 

d 2 

3 

3.017 

0.017 

0.567 

0.013 

d 2 

3 

3.014 

0.014 

0.467 

0.018 

d4 

3 

3.020 

0.020 

0.667 

0.011 

di 

3 

3.021 

0.021 

0.700 

0.012 

d ,5 

3 

2.999 

- 0.001 

0.033 

0.010 

dy 

3 

3.012 

0.012 

0.400 

0.013 

ds 

3 

2.999 

-0.001 

0.033 

0.011 

do 

3 

3.005 

0.005 

0.167 

0.014 

dio 

3 

3.016 

0.016 

0.533 

0.011 

Cll 

-0.5 

-0.519 

-0.019 

3.800 

0.034 

C 12 

2.5 

2.480 

-0.020 

0.800 

0.034 

C 13 

5.5 

5.509 

0.009 

0.164 

0.070 

Cl4 

8.5 

8.513 

0.013 

0.153 

0.142 

C 15 

11.5 

11.525 

0.025 

0.217 

0.211 

C21 

-0.5 

-0.513 

-0.013 

2.600 

0.041 

C22 

2.5 

2.501 

0.001 

0.040 

0.035 


(Table continues) 
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Table A6 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

C23 

5.5 

5.545 

0.045 

0.818 

0.065 

C24 

8.5 

8.543 

0.043 

0.506 

0.123 

C25 

11.5 

11.570 

0.070 

0.609 

0.192 

C31 

0.5 

0.509 

0.009 

1.800 

0.035 

C32 

3.5 

3.518 

0.018 

0.514 

0.046 

C33 

6.5 

6.547 

0.047 

0.723 

0.098 

C34 

9.5 

9.548 

0.048 

0.505 

0.192 

C35 

12.5 

12.570 

0.070 

0.560 

0.299 

C41 

0.5 

0.505 

0.005 

1.000 

0.035 

C42 

3.5 

3.514 

0.014 

0.400 

0.039 

C43 

6.5 

6.515 

0.015 

0.231 

0.064 

C44 

9.5 

9.546 

0.046 

0.484 

0.109 

C45 

12.5 

12.578 

0.078 

0.624 

0.162 

C51 

1.5 

1.494 

-0.006 

0.400 

0.027 

C52 

4.5 

4.513 

0.013 

0.289 

0.043 

C53 

7.5 

7.553 

0.053 

0.707 

0.083 

C54 

10.5 

10.563 

0.063 

0.600 

0.141 

C55 

13.5 

13.590 

0.090 

0.667 

0.230 

C61 

1.5 

1.506 

0.006 

0.400 

0.031 

C62 

4.5 

4.509 

0.009 

0.200 

0.038 

C63 

7.5 

7.506 

0.006 

0.080 

0.077 

C64 

10.5 

10.490 

- 0.010 

0.095 

0.124 

C65 

13.5 

13.498 

-0.002 

0.015 

0.208 

C71 

2.5 

2.528 

0.028 

1.120 

0.033 

C72 

5.5 

5.533 

0.033 

0.600 

0.067 

C73 

8.5 

8.535 

0.035 

0.412 

0.118 

C74 

11.5 

11.552 

0.052 

0.452 

0.177 

C75 

14.5 

14.536 

0.036 

0.248 

0.284 

C 8 1 

2.5 

2.494 

-0.006 

0.240 

0.030 

C82 

5.5 

5.507 

0.007 

0.127 

0.052 

C83 

8.5 

8.504 

0.004 

0.047 

0.098 

C84 

11.5 

11.508 

0.008 

0.070 

0.154 

C85 

14.5 

14.525 

0.025 

0.172 

0.241 

C91 

3.5 

3.521 

0.021 

0.600 

0.042 

C92 

6.5 

6.511 

0.011 

0.169 

0.065 

C93 

9.5 

9.532 

0.032 

0.337 

0.136 

C94 

12.5 

12.537 

0.037 

0.296 

0.224 

C95 

15.5 

15.549 

0.049 

0.316 

0.322 

ClOl 

3.5 

3.532 

0.032 

0.914 

0.038 

C102 

6.5 

6.526 

0.026 

0.400 

0.052 


(Table continues) 
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Table A6 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

Ci03 

9.5 

9.551 

0.051 

0.537 

0.094 

C104 

12.5 

12.560 

0.060 

0.480 

0.169 

cm 

15.5 

15.627 

0.127 

0.819 

0.291 

Latent-class sizes 

Class 1 

0.080 

0.080 

0.000 

0.000 


Class 1 

0.080 

0.080 

0.000 

0.000 


Class 2 

0.170 

0.170 

0.000 

0.000 


Class 3 

0.250 

0.250 

0.000 

0.000 


Class 4 

0.250 

0.250 

0.000 

0.000 


Class 5 

0.170 

0.172 

0.002 

1.176 


Class 6 

0.080 

0.079 

-0.001 

1.250 
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Appendix B 

Evaluation of the Estimated Standard Errors for d and the Latent-Class Sizes 


Table B1 


Intersection-Point Criteria, Fully Crossed, d = 2, N = 1,080 


Parameter 

SD 

Mean SE 

Bias 

di 

0.078 

0.080 

-0.002 

di 

0.088 

0.079 

-0.009 

di 

0.075 

0.078 

0.003 

d4 

0.071 

0.079 

0.008 

ds 

0.072 

0.079 

0.007 

d 6 

0.075 

0.079 

0.004 

dy 

0.082 

0.079 

-0.003 

d 8 

0.076 

0.079 

0.003 

d <9 

0.072 

0.079 

0.007 

dio 

0.078 

0.079 

0.001 

Class Size 1 

0.008 

0.009 

0.001 

Class Size 2 

0.013 

0.013 

0.000 

Class Size 3 

0.014 

0.015 

0.001 

Class Size 4 

0.013 

0.015 

0.002 

Class Size 5 

0.013 

0.013 

0.000 

Class Size 6 

0.010 

0.009 

0.001 


Table B2 

Intersection-Point Criteria, Fully Crossed, d = 3, N = 

1,080 

Parameter 

SD 

Mean SE 

Bias 

d; 

0.096 

0.103 

0.007 

dy 

0.104 

0.103 

-0.001 

di 

0.108 

0.104 

-0.004 

d4 

0.099 

0.103 

0.004 

di 

0.104 

0.103 

-0.001 

d 6 

0.108 

0.103 

-0.005 

dy 

0.102 

0.103 

0.001 

d 8 

0.096 

0.103 

0.007 

dg 

0.113 

0.103 

-0.010 

dio 

0.098 

0.103 

0.005 

Class Size 1 

0.008 

0.008 

0.000 

Class Size 2 

0.013 

0.012 

-0.001 

Class Size 3 

0.013 

0.013 

0.000 

Class Size 4 

0.015 

0.023 

0.008 

Class Size 5 

0.012 

0.012 

0.000 

Class Size 6 

0.008 

0.008 

0.000 
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Table B3 


Intersection-Point Criteria, Fully Crossed, d = 4, N 

= 1,080 

Parameter 

SD 

Mean SE 

Bias 

d] 

0.131 

0.132 

0.001 

dy 

0.142 

0.133 

-0.009 

dg 

0.119 

0.132 

0.013 

d4 

0.125 

0.132 

0.007 

dy 

0.124 

0.133 

0.009 

d 6 

0.135 

0.132 

-0.003 

dy 

0.134 

0.133 

-0.001 

dg 

0.136 

0.133 

-0.003 

dy 

0.131 

0.133 

0.002 

dw 

0.136 

0.133 

-0.003 

Class Size 1 

0.009 

0.008 

-0.001 

Class Size 2 

0.012 

0.012 

0.000 

Class Size 3 

0.015 

0.013 

-0.002 

Class Size 4 

0.012 

0.013 

0.001 

Class Size 5 

0.011 

0.011 

0.000 

Class Size 6 

0.008 

0.008 

0.000 


Table B4 

Intersection-Point Criteria, Fully Crossed, d = 5, N = 

1,080 

Parameter 

SD 

Mean SE 

Bias 

di 

0.219 

0.182 

-0.037 

dy 

0.181 

0.177 

-0.004 

dy 

0.181 

0.177 

-0.004 

d4 

0.193 

0.179 

-0.014 

dy 

0.185 

0.179 

-0.006 

d 6 

0.217 

0.180 

-0.037 

dy 

0.221 

0.179 

-0.042 

dg 

0.187 

0.179 

-0.008 

dy 

0.197 

0.178 

-0.019 

dw 

0.142 

0.178 

0.036 

Class Size 1 

0.103 

0.011 

-0.092 

Class Size 2 

0.075 

0.012 

-0.063 

Class Size 3 

0.040 

0.013 

-0.027 

Class Size 4 

0.039 

0.012 

-0.027 

Class Size 5 

0.074 

0.012 

-0.062 

Class Size 6 

0.103 

0.010 

-0.093 
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Table B5 

Shifted Criteria, Fully Crossed, d = 2, N= 1,080 


Parameter 

SD 

Mean SE 

Bias 

d] 

0.082 

0.084 

0.002 

dy 

0.080 

0.084 

0.004 

d 3 

0.080 

0.082 

0.002 

d4 

0.088 

0.082 

-0.006 

dy 

0.077 

0.080 

0.003 

d 6 

0.091 

0.081 

-0.010 

dy 

0.079 

0.082 

0.003 

dg 

0.081 

0.084 

0.003 

dy 

0.077 

0.084 

0.007 

dw 

0.079 

0.084 

0.005 

Class Size 1 

0.010 

0.010 

0.000 

Class Size 2 

0.012 

0.013 

0.001 

Class Size 3 

0.013 

0.015 

0.002 

Class Size 4 

0.016 

0.015 

-0.001 

Class Size 5 

0.012 

0.013 

0.001 

Class Size 6 

0.010 

0.010 

0.000 


Table B6 

Shifted Criteria, Fully Crossed, d = 

3, N = 1,080 


Parameter 

SD 

Mean SE 

Bias 

di 

0.082 

0.084 

0.002 

di 

0.121 

0.116 

-0.005 

dy 

0.114 

0.116 

0.002 

d 3 

0.136 

0.112 

-0.024 

d4 

0.104 

0.113 

0.009 

dy 

0.109 

0.104 

-0.005 

d 6 

0.101 

0.103 

0.002 

dy 

0.113 

0.112 

-0.001 

dg 

0.104 

0.111 

0.007 

dy 

0.119 

0.116 

-0.003 

dio 

0.105 

0.116 

0.011 

Class Size 1 

0.008 

0.008 

0.000 

Class Size 2 

0.011 

0.012 

0.001 

Class Size 4 

0.014 

0.013 

-0.001 

Class Size 5 

0.012 

0.012 

0.000 

Class Size 6 

0.008 

0.008 

0.000 
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Appendix C 

Parameter Estimates, Bias, Percentage Bias, and Mean Squared Error for 
the Balanced Incomplete Block (BIB) Design 


Table Cl 


Intersection-Point Criteria, Balanced Incomplete Block, d = 2, N = 1,080 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

d i 

2 

1.813 

-0.187 

9.350 

0.154 

d2 

2 

1.786 

-0.214 

10.700 

0.164 

di 

2 

1.799 

- 0.201 

10.050 

0.182 

d4 

2 

1.798 

- 0.202 

10.100 

0.157 

ds 

2 

1.839 

-0.161 

8.050 

0.150 

d ,5 

2 

1.804 

-0.196 

9.800 

0.144 

dy 

2 

1.793 

-0.207 

10.350 

0.117 

dg 

2 

1.772 

-0.228 

11.400 

0.149 

dg 

2 

1.824 

-0.176 

8.800 

0.162 

dw 

2 

1.784 

-0.216 

10.800 

0.148 

Cll 

1 

1.018 

-0.634 

63.400 

0.692 

C12 

3 

2.464 

-0.536 

17.866 

0.819 

Cl3 

5 

4.493 

-0.506 

10.120 

1.200 

Cm 

7 

6.524 

-0.476 

6.800 

1.776 

Cl5 

9 

8.613 

-0.387 

4.300 

2.237 

C21 

1 

1.004 

-0.606 

60.600 

0.603 

C22 

3 

2.464 

-0.536 

17.866 

0.767 

C23 

5 

4.484 

-0.516 

10.320 

1.143 

C24 

7 

6.482 

-0.518 

7.400 

1.581 

C25 

9 

8.502 

-0.498 

5.533 

2.273 

C31 

1 

1.023 

-0.667 

66.700 

0.766 

C32 

3 

2.394 

-0.606 

20.200 

1.002 

C33 

5 

4.457 

-0.543 

10.860 

1.375 

C34 

7 

6.499 

-0.501 

7.157 

2.095 

C35 

9 

8.568 

-0.432 

4.480 

2.760 

C41 

1 

0.986 

-0.642 

64.200 

0.646 

C42 

3 

2.406 

-0.594 

19.800 

0.835 

C43 

5 

4.435 

-0.565 

11.300 

1.122 

C44 

7 

6.516 

-0.484 

6.914 

1.599 

C45 

9 

8.596 

-0.403 

4.477 

2.213 

C51 

1 

0.348 

-0.652 

65.200 

0.713 

C52 

3 

2.481 

-0.519 

17.300 

0.905 

C53 

5 

4.576 

-0.424 

8.480 

1.193 


(Table continues) 
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Table Cl (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

C 54 

7 

6.630 

-0.369 

5.271 

1.769 

C55 

9 

8.745 

-0.254 

2.822 

2.275 

C6i 

1 

0.360 

-0.639 

63.900 

0.655 

C62 

3 

2.456 

-0.543 

18.100 

0.726 

C63 

5 

4.493 

-0.506 

10.120 

1.030 

C64 

7 

6.462 

-0.537 

7.671 

1.428 

C65 

9 

8.531 

-0.468 

5.520 

2.002 

C71 

1 

0.384 

-0.615 

61.500 

0.590 

C 72 

3 

2.432 

-0.567 

18.900 

0.719 

C73 

5 

4.468 

-0.532 

10.640 

0.879 

C 74 

7 

6.549 

-0.450 

6.428 

1.275 

C 75 

9 

8.568 

-0.432 

4.800 

1.670 

C81 

1 

0.339 

-0.661 

66.100 

0.719 

C82 

3 

2.384 

-0.616 

20.533 

0.921 

C83 

5 

4.415 

-0.585 

11.700 

1.116 

C84 

7 

6.456 

-0.544 

7.771 

1.441 

C85 

9 

8.499 

-0.501 

5.566 

2.022 

C91 

1 

0.439 

-0.561 

56.100 

0.633 

C92 

3 

2.533 

-0.466 

15.533 

0.835 

C93 

5 

4.570 

-0.430 

8.600 

1.195 

C 94 

7 

6.610 

-0.390 

5.571 

1.654 

C 95 

9 

8.693 

-0.307 

3.411 

2.441 

cm 

1 

0.360 

-0.640 

64.000 

0.654 

C 102 

3 

2.415 

-0.585 

19.500 

0.730 

C 103 

5 

4.430 

-0.570 

11.400 

1.001 

Cl04 

7 

6.455 

-0.545 

7.875 

1.328 

cm 

9 

8.489 

-0.511 

5.677 

2.069 

Latent-class sizes 

Class 1 

0.080 

0.146 

0.066 

82.500 


Class 2 

0.170 

0.120 

-0.050 

29.412 


Class 3 

0.250 

0.240 

- 0.010 

4.000 


Class 4 

0.250 

0.231 

-0.019 

7.600 


Class 5 

0.170 

0.130 

-0.040 

23.529 


Class 6 

0.080 

0.133 

0.053 

66.250 
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Table C2 


Intersection-Point Criteria, Balanced Incomplete Block, d = 3, N = 1,080 


Parameter Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

di 

3 

2.862 

-0.137 

4.567 

0.231 

d 2 

3 

2.885 

-0.115 

3.833 

0.206 

d 2 

3 

2.773 

-0.226 

7.533 

0.246 

d4 

3 

2.857 

-0.143 

4.767 

0.220 

ds 

3 

3.006 

0.006 

0.200 

0.272 

d 6 

3 

2.895 

-0.105 

3.500 

0.223 

dy 

3 

2.908 

-0.009 

0.300 

0.279 

da 

3 

2.938 

-0.006 

0.200 

0.209 

di 9 

3 

2.892 

-0.107 

3.567 

0.187 

dw 

3 

2.895 

-0.148 

4.933 

0.206 

Cll 

1.5 

0.888 

-0.612 

40.800 

0.755 

C12 

4.5 

4.057 

-0.443 

9.844 

0.886 

C13 

7.5 

7.117 

-0.383 

5.107 

1.460 

Cm 

10.5 

10.250 

-0.250 

2.381 

2.636 

C15 

13.5 

13.353 

-0.147 

1.089 

4.337 

C21 

1.5 

0.954 

-0.546 

36.400 

0.598 

C 2 2 

4.5 

4.059 

-0.441 

9.800 

0.872 

C23 

7.5 

7.164 

-0.336 

4.480 

1.415 

C24 

10.5 

10.340 

-0.160 

1.524 

2.494 

C25 

13.5 

13.432 

-0.068 

0.503 

3.829 

C31 

1.5 

0.879 

-0.621 

41.400 

0.716 

C32 

4.5 

3.856 

-0.644 

14.311 

0.970 

C33 

7.5 

6.906 

-0.594 

7.920 

1.780 

C34 

10.5 

9.845 

-0.655 

6.238 

2.578 

C35 

13.5 

12.975 

-0.525 

3.889 

4.125 

C41 

1.5 

0.887 

-0.613 

40.867 

0.681 

C42 

4.5 

4.060 

-0.440 

9.778 

0.744 

C43 

7.5 

7.176 

-0.324 

4.320 

1.478 

C 44 

10.5 

10.206 

-0.294 

2.800 

2.497 

C45 

13.5 

13.335 

-0.165 

1.222 

3.745 

C51 

1.5 

1.007 

-0.493 

32.867 

0.583 

C52 

4.5 

4.261 

-0.239 

5.311 

0.956 

C53 

7.5 

7.451 

-0.048 

0.640 

2.138 

C54 

10.5 

10.675 

0.175 

1.667 

3.455 

C55 

13.5 

13.938 

0.438 

3.244 

5.352 

C61 

1.5 

0.914 

-0.586 

39.067 

0.643 

C62 

4.5 

4.038 

-0.462 

10.267 

0.894 

C63 

7.5 

7.227 

-0.273 

3.640 

1.530 

C64 

10.5 

10.296 

-0.203 

1.933 

2.704 


(Table continues) 
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Table C2 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

C65 

13.5 

13.547 

0.047 

0.348 

4.229 

C 7 1 

1.5 

0.907 

-0.056 

3.733 

0.603 

C72 

4.5 

3.995 

-0.505 

11.222 

1.044 

C73 

7.5 

7.223 

-0.277 

3.693 

1.978 

C 74 

10.5 

10.318 

-0.182 

1.733 

3.099 

C75 

13.5 

13.546 

-0.046 

0.341 

5.751 

C81 

1.5 

0.994 

-0.506 

33.733 

0.555 

C82 

4.5 

4.163 

-0.337 

7.489 

0.736 

C83 

7.5 

7.355 

-0.145 

1.933 

1.568 

C84 

10.5 

10.431 

-0.069 

0.657 

2.449 

C85 

13.5 

13.687 

0.187 

1.385 

4.363 

C91 

1.5 

0.994 

-0.506 

33.733 

0.546 

C92 

4.5 

4.133 

-0.366 

8.133 

0.669 

C93 

7.5 

7.176 

-0.324 

4.320 

1.246 

C 94 

10.5 

10.292 

-0.208 

1.981 

1.932 

C 95 

13.5 

13.456 

-0.044 

0.326 

3.134 

Cioi 

1.5 

0.914 

-0.586 

39.067 

0.559 

C102 

4.5 

4.076 

-0.042 

0.933 

0.745 

C103 

7.5 

7.100 

-0.399 

5.320 

1.381 

C104 

10.5 

10.171 

-0.329 

3.133 

2.377 

cm 

13.5 

13.278 

-0.222 

1.644 

3.870 

Latent-class sizes 

Class 1 

0.080 

0.109 

0.029 

36.250 


Class 2 

0.170 

0.158 

-0.012 

7.060 


Class 3 

0.250 

0.235 

-0.015 

6.000 


Class 4 

0.250 

0.236 

-0.014 

5.600 


Class 5 

0.170 

0.155 

-0.015 

8.824 


Class 6 

0.080 

0.106 

0.026 

32.500 


Table C3 






Intersection-Point Criteria, Balanced Incomplete Block, d 

= 4, N = 

1,080 

Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

di 

4 

4.011 

0.011 

0.375 

0.017 

d2 

4 

3.918 

-0.082 

2.050 

0.291 

d3 

4 

3.924 

-0.076 

1.900 

0.356 

d4 

4 

4.052 

0.052 

1.300 

0.265 

ds 

4 

3.969 

-0.031 

0.775 

0.275 


(Table continues) 
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Table C3 (continued) 


Parameter Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

d 6 

4 

3.943 

-0.057 

1.425 

0.282 

dy 

4 

3.990 

- 0.010 

0.250 

0.305 

d 8 

4 

4.003 

0.003 

0.075 

0.304 

dg 

4 

3.918 

-0.082 

2.050 

0.235 

dw 

4 

3.910 

-0.090 

2.250 

0.267 

Cll 

2 

1.529 

-0.471 

23.550 

0.639 

C12 

6 

5.868 

-0.132 

2.200 

1.234 

C13 

10 

10.015 

0.015 

0.150 

2.331 

Cl4 

14 

14.189 

0.189 

1.350 

3.759 

CIS 

18 

18.452 

0.452 

2.511 

6.089 

C21 

2 

1.536 

-0.046 

2.300 

0.687 

C22 

6 

5.709 

0.290 

4.833 

1.091 

C23 

10 

9.789 

-0.211 

2.110 

1.980 

C24 

14 

13.811 

-0.189 

1.350 

3.602 

C25 

18 

18.056 

0.056 

0.311 

6.032 

C31 

2 

1.510 

-0.490 

24.500 

0.576 

C32 

6 

5.715 

-0.284 

4.733 

1.122 

C33 

10 

9.765 

-0.234 

2.340 

2.232 

C34 

14 

13.824 

-0.176 

1.257 

4.032 

C3S 

18 

18.100 

-0.099 

0.550 

7.133 

C41 

2 

1.645 

0.355 

17.750 

0.623 

C42 

6 

5.889 

- 0.111 

1.850 

0.953 

C43 

10 

10.129 

0.129 

1.290 

1.890 

C44 

14 

14.340 

0.340 

2.429 

3.516 

C45 

18 

18.629 

0.629 

3.494 

6.074 

Csi 

2 

1.645 

-0.492 

24.600 

0.655 

CS2 

6 

5.813 

-0.187 

3.117 

0.953 

CS3 

10 

9.951 

-0.049 

0.490 

1.924 

C54 

14 

14.107 

0.107 

0.764 

3.541 

Css 

18 

18.256 

0.256 

1.422 

5.612 

C61 

2 

1.511 

-0.489 

24.450 

0.724 

C62 

6 

5.713 

-0.286 

4.767 

1.059 

C63 

10 

9.860 

-0.140 

1.400 

2.131 

C64 

14 

14.015 

0.015 

0.107 

3.676 

C6S 

18 

18.217 

0.217 

1.206 

5.969 

C71 

2 

1.538 

0.462 

23.100 

0.679 

C72 

6 

5.682 

-0.318 

5.300 

0.931 

C73 

10 

9.943 

-0.057 

0.570 

1.770 

C74 

14 

14.066 

0.066 

0.471 

3.602 

C75 

18 

18.289 

0.289 

1.606 

5.675 


(Table continues) 
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Table C3 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

C81 

2 

1.594 

-0.406 

20.300 

0.617 

C82 

6 

5.849 

-0.151 

2.517 

1.103 

C83 

10 

10.037 

0.037 

0.370 

2.067 

C84 

14 

14.227 

0.227 

1.621 

3.900 

C8S 

18 

18.364 

0.364 

2.022 

5.797 

C91 

2 

1.511 

-0.489 

24.450 

0.600 

C92 

6 

5.698 

-0.302 

5.033 

0.844 

C93 

10 

9.802 

-0.198 

1.980 

1.796 

C 94 

14 

13.865 

-0.135 

0.964 

3.060 

C95 

18 

18.020 

0.020 

0.111 

4.817 

cm 

2 

1.472 

-0.528 

26.400 

0.626 

C102 

6 

5.674 

0.326 

5.433 

0.911 

Cl03 

10 

9.792 

-0.208 

2.080 

1.764 

C104 

14 

13.850 

-0.150 

1.071 

3.129 

Cl05 

18 

17.993 

-0.007 

0.039 

5.073 

Latent-class sizes 

Class 1 

0.080 

0.094 

0.014 

17.500 


Class 2 

0.170 

0.164 

-0.006 

3.529 


Class 3 

0.250 

0.240 

- 0.010 

4.000 


Class 4 

0.250 

0.245 

-0.005 

2.000 


Class 5 

0.170 

0.163 

-0.007 

4.118 


Class 6 

0.080 

0.094 

0.014 

17.500 



Table C4 

Shifted Criteria, Balanced Incomplete Block, d = 3, N = 

1,080 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

di 

3 

2.750 

-0.250 

8.333 

0.240 

di 

3 

2.688 

-0.312 

10.400 

0.270 

<7? 

3 

2.811 

-0.189 

6.300 

0.225 

d 4 

3 

2.753 

-0.247 

8.233 

0.232 

ds 

3 

2.857 

-0.143 

4.767 

0.260 

d 6 

3 

2.833 

-0.167 

5.567 

0.196 

dy 

3 

2.798 

-0.202 

6.733 

0.214 

ds 

3 

2.800 

-0.200 

6.667 

0.242 

do 

3 

2.691 

-0.309 

10.300 

0.281 


(Table continues) 
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Table C4 (continued) 


Parameter Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

dm 

3 

2.787 

-0.213 

7.100 

0.258 

Cll 

-0.5 

-0.861 

-0.361 

72.200 

0.458 

C12 

2.5 

1.916 

-0.584 

23.360 

1.028 

CIS 

5.5 

4.858 

-0.642 

11.673 

1.433 

C14 

8.5 

7.900 

-0.600 

7.059 

2.094 

CIS 

11.5 

10.940 

-0.560 

4.870 

3.002 

C21 

-0.5 

-1.017 

-0.517 

103.400 

0.749 

C22 

2.5 

1.818 

-0.682 

27.280 

1.181 

C23 

5.5 

4.866 

-0.634 

11.527 

1.506 

C24 

8.5 

7.773 

-0.727 

8.553 

2.415 

C25 

11.5 

10.817 

-0.683 

5.939 

3.151 

C31 

0.5 

0.005 

-0.495 

99.000 

0.566 

C32 

3.5 

2.975 

-0.525 

15.000 

0.743 

C33 

6.5 

6.142 

-0.358 

5.508 

1.256 

C34 

9.5 

9.129 

-0.371 

3.905 

2.128 

C35 

12.5 

12.181 

-0.319 

2.552 

3.152 

C41 

0.5 

-0.011 

-0.511 

102.200 

0.684 

C42 

3.5 

2.938 

-0.562 

16.057 

1.031 

C43 

6.5 

5.930 

-0.570 

8.769 

1.505 

C44 

9.5 

8.928 

-0.572 

6.021 

2.473 

C45 

12.5 

12.035 

-0.465 

3.720 

3.298 

C51 

1.5 

1.040 

-0.460 

30.667 

0.792 

CS2 

4.5 

4.068 

-0.432 

9.600 

1.289 

CS3 

7.5 

7.198 

-0.302 

4.027 

1.845 

C54 

10.5 

10.311 

-0.189 

1.800 

3.082 

C55 

13.5 

13.348 

-0.152 

1.126 

4.606 

C61 

1.5 

0.902 

-0.598 

39.867 

0.771 

C62 

4.5 

4.066 

-0.434 

9.644 

1.005 

C63 

7.5 

7.134 

-0.366 

4.880 

1.349 

C64 

10.5 

10.238 

-0.262 

2.495 

2.226 

C65 

13.5 

13.270 

-0.230 

1.704 

3.067 

C71 

2.5 

1.984 

-0.516 

20.640 

0.778 

C72 

5.5 

5.066 

-0.434 

7.891 

1.130 

C73 

8.5 

8.032 

-0.468 

5.506 

1.680 

C74 

11.5 

11.169 

-0.331 

2.878 

2.500 

C75 

14.5 

14.042 

-0.458 

3.159 

3.555 

C81 

2.5 

1.903 

-0.597 

23.880 

1.071 

C82 

5.5 

5.508 

0.008 

0.145 

1.225 

C83 

8.5 

8.036 

-0.464 

5.459 

2.157 

C84 

11.5 

11.165 

-0.335 

2.913 

3.374 


(Table continues) 
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Table C4 (continued) 


Parameter 

Value 

Estimate 

Bias 

% Bias 

MSE 

Rater parameters 

C85 

14.5 

14.124 

-0.376 

2.593 

4.423 

C91 

3.5 

2.855 

-0.645 

18.429 

1.221 

C92 

6.5 

5.796 

-0.704 

10.831 

1.889 

C93 

9.5 

8.750 

-0.750 

7.895 

2.714 

C94 

12.5 

11.788 

-0.712 

5.696 

3.746 

C95 

15.5 

14.546 

-0.954 

6.155 

5.092 

cm 

3.5 

2.951 

-0.549 

15.686 

1.313 

C102 

6.5 

5.990 

-0.510 

7.846 

1.674 

cm 

9.5 

9.046 

-0.454 

4.779 

2.446 

C104 

12.5 

12.143 

-0.357 

2.856 

3.458 

cm 

15.5 

14.986 

-0.514 

3.316 

4.925 

Latent-class sizes 

Class 1 

0.080 

0.110 

0.030 

37.500 


Class 2 

0.170 

0.152 

-0.018 

10.588 


Class 3 

0.250 

0.235 

-0.015 

6.000 


Class 4 

0.250 

0.230 

-0.020 

8.000 


Class 5 

0.170 

0.158 

-0.012 

7.059 


Class 6 

0.080 

0.115 

0.035 

43.750 
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Appendix D 

Evaluation of the Estimated Standard Errors for d and the Latent Class Sizes, 
Balanced Incomplete Block (BIB) Design 


Table D1 


Intersection-Point Criteria, Balanced Incomplete Block, d = 2, N = 1,080 


Parameter 

SD 

Mean SE 

Bias 

di 

0.347 

0.334 

-0.013 

d 3 

0.346 

0.324 

-0.022 

d 3 

0.378 

0.326 

-0.052 

d4 

0.342 

0.326 

-0.016 

ds 

0.353 

0.336 

-0.017 

d6 

0.327 

0.327 

0.000 

dy 

0.273 

0.322 

0.049 

da 

0.314 

0.318 

0.004 

dy 

0.364 

0.331 

-0.033 

dw 

0.320 

0.323 

0.003 

Class Size 1 

0.037 

0.040 

0.003 

Class Size 2 

0.050 

0.049 

-0.001 

Class Size 3 

0.057 

0.055 

-0.002 

Class Size 4 

0.060 

0.055 

-0.005 

Class Size 5 

0.049 

0.046 

-0.003 

Class Size 6 

0.030 

0.039 

0.009 


Table D2 


Intersection-Point Criteria, Balanced Incomplete Block, d = 3, N = 1,080 


Parameter 

SD 

Mean SE 

Bias 

di 

0.463 

0.445 

-0.018 

dy 

0.442 

0.447 

0.005 

d 3 

0.443 

0.424 

-0.019 

d4 

0.450 

0.444 

-0.006 

ds 

0.524 

0.481 

-0.043 

d 6 

0.463 

0.449 

-0.014 

dy 

0.523 

0.454 

-0.069 

da 

0.455 

0.460 

0.005 

dy 

0.421 

0.449 

0.028 

dw 

0.431 

0.437 

0.006 

Class Size 1 

0.018 

0.021 

0.003 

Class Size 2 

0.027 

0.025 

-0.002 

Class Size 3 

0.029 

0.029 

0.000 

Class Size 4 

0.029 

0.029 

0.000 

Class Size 5 

0.027 

0.025 

-0.002 

Class Size 6 

0.020 

0.021 

0.001 
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Table D3 


Intersection-Point Criteria, Balanced Incomplete Block, d = 4, N = 1,080 


Parameter 

SD 

Mean SE 

Bias 

d] 

0.559 

0.573 

0.014 

dy 

0.535 

0.547 

0.012 

d 3 

0.595 

0.546 

-0.049 

d4 

0.515 

0.573 

0.058 

dy 

0.526 

0.558 

0.032 

d 6 

0.531 

0.552 

0.021 

dy 

0.555 

0.556 

0.001 

dg 

0.554 

0.562 

0.008 

dy 

0.480 

0.542 

0.062 

dw 

0.512 

0.543 

0.031 

Class Size 1 

0.015 

0.014 

-0.001 

Class Size 2 

0.021 

0.019 

-0.002 

Class Size 3 

0.024 

0.023 

-0.001 

Class Size 4 

0.020 

0.023 

0.003 

Class Size 5 

0.019 

0.020 

0.001 

Class Size 6 

0.014 

0.015 

0.001 


Table D4 

Shifted Criteria, BIB, d = 3, N = 

1,080 


Parameter 

SD 

Mean SE 

Bias 

di 

0.423 

0.482 

0.059 

dy 

0.418 

0.463 

0.045 

d 3 

0.437 

0.498 

0.024 

d4 

0.416 

0.474 

0.058 

dy 

0.492 

0.478 

-0.014 

d 6 

0.413 

0.475 

0.062 

dy 

0.419 

0.485 

0.066 

dg 

0.452 

0.489 

0.037 

dy 

0.432 

0.467 

0.035 

dw 

0.464 

0.498 

0.434 

Class Size 1 

0.025 

0.026 

0.001 

Class Size 2 

0.028 

0.027 

-0.001 

Class Size 4 

0.032 

0.029 

-0.003 

Class Size 5 

0.031 

0.029 

-0.002 

Class Size 6 

0.031 

0.027 

-0.004 
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